Methods And Systems For Anomaly Detection In A Product

< Back

Methods And Systems For Anomaly Detection In A Product

Abstract: ABSTRACT METHODS AND SYSTEMS FOR ANOMALY DETECTION IN A PRODUCT This disclosure relates generally to methods and systems for detecting multi-class anomalies present in a manufacturing product. The neural network based conventional methods to solve multi-class anomaly detection are not efficient and accurate especially in case of high intraclass variation. The 3 channel RGB image and its corresponding single-channel binary ground-truth are passed to the encoder network. The feature maps from the encoder are convolved with a set of attention modules, to implement class-specific self-attention, and then fed into the multi-scale attention network. The refined feature maps from the multi-scale attention network are fed into the logits formation network. The decoder features maps are also convolved with a set of attention modules and then fed into the logits formation network. The aggregated feature maps are passed to the soft-max activation function to obtain the probability maps which contains a probability for each defect class and a normal class. [To be published with FIG. 3]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

14 June 2022

Publication Number

50/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th floor, Nariman point, Mumbai 400021, Maharashtra, India

Inventors

1. PRADHAN, Prakhar

Tata Consultancy Services Limited, Innovation Labs, 7th floor, ODC-4, Gopalan Global axis H block KIADB Export Promotion Area, Whitefield, Bangalore 560066, Karnataka, India

2. SHARMA, Hrishikesh

Tata Consultancy Services Limited, Innovation Labs, 7th floor, ODC-4, Gopalan Global axis H block KIADB Export Promotion Area, Whitefield, Bangalore 560066, Karnataka, India

Specification

Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:

METHODS AND SYSTEMS FOR ANOMALY DETECTION IN A PRODUCT

Applicant

Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to the field of anomaly detection and more specifically to methods and systems for detecting multi-class anomalies present in a manufacturing product.

BACKGROUND
In manufacturing sector, due to sporadic physical, deterministic problems in the machinery, localized defects of specific nature arise on the surface of manufactured products or items. For example, crazing happens due to application of non-optimal stress on manufacturing material, which may arise out of non-optimal setting of the production control parameters. This is unlike the definition of anomaly as is studied in theoretical context, and which encompasses defects, where there is definition of only the normal pattern (`normal’ class) and every pattern/non-pattern deviant from the normal pattern is deemed as anomaly. These anomalies need to be detected at the earlier stag to further avoid the deterioration in quality of the product.
Existing methods for vision-based anomaly detection are based on the aforementioned theoretical definition, and not efficient and accurate to account for specific nature and types of anomalies arising specifically in the manufacturing sector. Also, the neural network based conventional methods to solve multi-class anomaly detection incorporate class-specific attention at an output layer, mainly for visualization purposes and not for optimizing the required performance metrics. Further, the neural network based conventional methods to solve multi-class anomaly detection are not efficient and accurate especially in case of high intraclass variation of the defects.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
In an aspect, a processor-implemented method for anomaly detection in a product is provided. The method including the steps of: receiving, a plurality of 3-channel RGB images corresponding to each class of a plurality of classes, and a single-channel binary ground-truth image corresponding to each 3-channel RGB image of the plurality of 3-channel RGB images, wherein the plurality of classes comprises of one or more anomaly classes and a normal class; forming, one or more mini-batches from the plurality of 3-channel RGB images corresponding to each class, based on a predefined mini-batch size, wherein each mini-batch comprises one or more 3-channel RGB images out of the plurality of 3-channel RGB images corresponding to each class; training, a multi-class segmentation network, with the one or more 3-channel RGB images present in each mini-batch associated with each class at a time, until the one or more mini-batches associated with the plurality of classes are completely ingested sequentially for a predefined training epochs, wherein the multi-class segmentation network comprises an encoder network, one or more encoder attention module sets, a decoder network, one or more decoder attention module sets, a multi-scale attention network, and a logits formation network, and wherein training the multi-class segmentation network with the one or more 3-channel RGB images present in each mini-batch associated with each class comprises: extracting a first set of encoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding 3-channel RGB image to the encoder network; passing the first set of encoded feature maps of each 3-channel RGB image through corresponding encoder attention module set connected to the encoder network, to obtain a second set of encoded feature maps of each 3-channel RGB image; extracting a first set of decoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding first set of encoded feature maps to the decoder network; passing the first set of decoded feature maps of each 3-channel RGB image through corresponding decoder attention module set connected to the decoder network, to obtain a second set of decoded feature maps of each 3-channel RGB image; passing the second set of encoded feature maps of each 3-channel RGB image to the multi-scale attention network, to obtain a set of scaled feature maps of each 3-channel RGB image; passing the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain a probability map of each 3-channel RGB image; calculating value of a multi-class cross-entropy loss function of the multi-class segmentation network, for the one or more 3-channel RGB images present in the mini-batch, using the obtained probability map of each 3-channel RGB image and the corresponding single-channel binary ground-truth image; and updating weights of the multi-class segmentation network, based on the calculated value of the multi-class cross-entropy loss function of the multi-class segmentation network; receiving, an input 3-channel RGB image of a product, for which the anomaly to be detected; passing, the input 3-channel RGB image to the trained multi-class segmentation network, to obtain an input probability map; and detecting, the presence of anomaly in the product, based on the input probability map.
In another aspect, a system for anomaly detection in a product is provided. The system includes: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a plurality of 3-channel RGB images corresponding to each class of a plurality of classes, and a single-channel binary ground-truth image corresponding to each 3-channel RGB image of the plurality of 3-channel RGB images, wherein the plurality of classes comprises of one or more anomaly classes and a normal class; form one or more mini-batches from the plurality of 3-channel RGB images corresponding to each class, based on a predefined mini-batch size, wherein each mini-batch comprises one or more 3-channel RGB images out of the plurality of 3-channel RGB images corresponding to each class; train a multi-class segmentation network, with the one or more 3-channel RGB images present in each mini-batch associated with each class at a time, until the one or more mini-batches associated with the plurality of classes are completely ingested sequentially for a predefined training epochs, wherein the multi-class segmentation network comprises an encoder network, one or more encoder attention module sets, a decoder network, one or more decoder attention module sets, a multi-scale attention network, and a logits formation network, and wherein training the multi-class segmentation network with the one or more 3-channel RGB images present in each mini-batch associated with each class comprises: extracting a first set of encoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding 3-channel RGB image to the encoder network; passing the first set of encoded feature maps of each 3-channel RGB image through corresponding encoder attention module set connected to the encoder network, to obtain a second set of encoded feature maps of each 3-channel RGB image; extracting a first set of decoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding first set of encoded feature maps to the decoder network; passing the first set of decoded feature maps of each 3-channel RGB image through corresponding decoder attention module set connected to the decoder network, to obtain a second set of decoded feature maps of each 3-channel RGB image; passing the second set of encoded feature maps of each 3-channel RGB image to the multi-scale attention network, to obtain a set of scaled feature maps of each 3-channel RGB image; passing the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain a probability map of each 3-channel RGB image; calculating value of a loss function of the multi-class segmentation network, for the one or more 3-channel RGB images present in the mini-batch, using the obtained probability map of each 3-channel RGB image and the corresponding single-channel binary ground-truth image; and updating weights of the multi-class segmentation network, based on the calculated value of the loss function of the multi-class segmentation network; receive an input 3-channel RGB image of a product, for which the anomaly to be detected; pass the input 3-channel RGB image to the trained multi-class segmentation network, to obtain an input probability map; and detect the presence of anomaly in the product, based on the input probability map.
In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: receive a plurality of 3-channel RGB images corresponding to each class of a plurality of classes, and a single-channel binary ground-truth image corresponding to each 3-channel RGB image of the plurality of 3-channel RGB images, wherein the plurality of classes comprises of one or more anomaly classes and a normal class; form one or more mini-batches from the plurality of 3-channel RGB images corresponding to each class, based on a predefined mini-batch size, wherein each mini-batch comprises one or more 3-channel RGB images out of the plurality of 3-channel RGB images corresponding to each class; train a multi-class segmentation network, with the one or more 3-channel RGB images present in each mini-batch associated with each class at a time, until the one or more mini-batches associated with the plurality of classes are completely ingested sequentially for a predefined training epochs, wherein the multi-class segmentation network comprises an encoder network, one or more encoder attention module sets, a decoder network, one or more decoder attention module sets, a multi-scale attention network, and a logits formation network, and wherein training the multi-class segmentation network with the one or more 3-channel RGB images present in each mini-batch associated with each class comprises: extracting a first set of encoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding 3-channel RGB image to the encoder network; passing the first set of encoded feature maps of each 3-channel RGB image through corresponding encoder attention module set connected to the encoder network, to obtain a second set of encoded feature maps of each 3-channel RGB image; extracting a first set of decoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding first set of encoded feature maps to the decoder network; passing the first set of decoded feature maps of each 3-channel RGB image through corresponding decoder attention module set connected to the decoder network, to obtain a second set of decoded feature maps of each 3-channel RGB image; passing the second set of encoded feature maps of each 3-channel RGB image to the multi-scale attention network, to obtain a set of scaled feature maps of each 3-channel RGB image; passing the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain a probability map of each 3-channel RGB image; calculating value of a loss function of the multi-class segmentation network, for the one or more 3-channel RGB images present in the mini-batch, using the obtained probability map of each 3-channel RGB image and the corresponding single-channel binary ground-truth image; and updating weights of the multi-class segmentation network, based on the calculated value of the loss function of the multi-class segmentation network; receive an input 3-channel RGB image of a product, for which the anomaly to be detected; pass the input 3-channel RGB image to the trained multi-class segmentation network, to obtain an input probability map; and detect the presence of anomaly in the product, based on the input probability map.
In an embodiment, the encoder network comprises a plurality of encoder blocks; the decoder network comprises a plurality of decoder blocks; the one or more encoder attention module sets is connected to the encoder network, wherein each encoder attention module set comprises a plurality of encoder attention modules dedicated to each class of the plurality of classes; and the one or more decoder attention module sets is connected to the decoder network, wherein each decoder attention module set comprises the plurality of decoder attention modules dedicated to each class of the plurality of classes.
In an embodiment, each encoder attention module comprises an encoder attention residual unit, an encoder attention logic gate unit, a first encoder attention convolutional layer, an encoder attention soft-max activation layer, and a second encoder attention convolutional layer, and wherein the encoder attention residual unit comprises a first encoder attention residual unit batch normalization layer, a first encoder attention residual unit ReLU activation layer, a first encoder attention residual unit convolutional layer, a second encoder attention residual unit batch normalization layer, a second encoder attention residual unit ReLU activation layer, a second encoder attention residual unit convolutional layer, and a third encoder attention residual unit convolutional layer; and each decoder attention module comprises a decoder attention residual unit, a decoder attention logic gate unit, a first decoder attention convolutional layer, an decoder attention soft-max activation layer, and a second decoder attention convolutional layer, and wherein the decoder attention residual unit comprises a first decoder attention residual unit batch normalization layer, a first decoder attention residual unit ReLU activation layer, a first decoder attention residual unit convolutional layer, a second decoder attention residual unit batch normalization layer, a second decoder attention residual unit ReLU activation layer, a second decoder attention residual unit convolutional layer, and a third decoder attention residual unit convolutional layer.
In an embodiment, a number of each of (i) the one or more encoder attention module sets connected to the encoder network, and (ii) the one or more decoder attention module sets connected to the decoder network, is equal to a number of the plurality of classes.
In an embodiment, during training the multi-class segmentation network with each of the one or more 3-channel RGB images present in each mini-batch associated with each class, (i) the corresponding encoder attention module set, (ii) the corresponding decoder attention module set, (iii) the multiscale attention network, (iv) the logits formation network, (v) the encoder network, and (vi) the decoder network, are activated for back-propagation.
In an embodiment, passing the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain the probability map of each 3-channel RGB image, comprises: merging the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps through a set of convolutional-deconvolutional layers present in the logits formation network, to obtain a set of aggregated feature maps of each 3-channel RGB image; concatenating the set of aggregated feature maps of each 3-channel RGB image through a logits concatenation layer present in the logits formation network, to obtain a concatenated feature map of each 3-channel RGB image; and passing the concatenated feature map of each 3-channel RGB image, through a convolutional layer and a soft-max activation function present in the logits formation network, to obtain the probability map of each 3-channel RGB image.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 is an exemplary block diagram of a system for anomaly detection in a product, in accordance with some embodiments of the present disclosure.
FIG. 2A through FIG. 2C illustrates exemplary flow diagrams of a processor-implemented method for anomaly detection in a product, in accordance with some embodiments of the present disclosure.
FIG. 3 shows a high-level block diagram of a multi-class segmentation network, in accordance with some embodiments of the present disclosure.
FIGS. 4A and 4B shows an exemplary architecture diagram of the multi-class segmentation network, in accordance with some embodiments of the present disclosure.
FIG. 5 shows an exemplary architecture diagram of an attention module, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
In the manufacturing sector, there are many applications that require detection of various kinds of defects on the produced lots of products or items. The defects mostly arise due to malfunction of the plant or the machine, sometimes due to fault in machine parts and incorrect controls applied to the machinery, whether manually or automatic. The defects also vary from item to item. For example, bad print quality on fabric is a form of textural defect, while a missing lead from an electrical component is a structural defect. The defect on manufactured items manifests once-in-a-while. Manufacturing Quality Assurance is an important responsibility of the supplier, and it deals with detection of defects in various produced lots. Timely detection of defects can not only help in fixing the machinery problem in timely manner, but it also helps in segregating the normal produced lots with the faulty lots, and deal with faulty lots in a separate way.
The detection of defects can be carried out in various ways. The most popular and cost-effective way is vision-based defect detection. Vision-based defect detection falls under the broader purview of anomaly detection, which also deals with detecting colony of carcinomic cells in microscopic images, out-of-context objects in various scenes e.g., bus on a water surface. The dominant philosophy of vision-based anomaly detection in last many decades has been that one must model only the normal pattern (`normal’ class), and every pattern/non-pattern deviant from the normal pattern is deemed as anomaly, or an anomalous region of the image. Such philosophy not just lumps all kinds of defects into a single category (‘anomaly’), but also critically fails to exploit any specific pattern in the defect class in learning a discriminative model which is bound to provide better detection performance. Especially in manufacturing scenario, it is well-known that various kinds of defects do have some degree of noisy visual patterns in them.
The neural network based conventional methods to solve multi-class anomaly detection incorporate class-specific attention at the output layer, mainly for visualization purposes and not for optimizing the required performance metrics. The present disclosure solves the technical problems in the art for multi-class anomaly detection, by using a class-specific attention modules at both an encoder network and a decoder network of a multi-class segmentation network, for obtaining enriched and scaled features to discriminate among various kinds of defects and localize them with robust accuracy, even in case of high intraclass variation of the defects.
Referring now to the drawings, and more particularly to FIG. 1 through FIG. 5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary systems and/or methods.
FIG. 1 is an exemplary block diagram of a system 100 for anomaly detection in a product, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes or is otherwise in communication with one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more hardware processors 104, the memory 102, and the I/O interface(s) 106 may be coupled to a system bus 108 or a similar mechanism.
The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface(s) 106 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a plurality of sensor devices, a printer and the like. Further, the I/O interface(s) 106 may enable the system 100 to communicate with other devices, such as web servers and external databases.
The I/O interface(s) 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface(s) 106 may include one or more ports for connecting a number of computing systems with one another or to another server computer. Further, the I/O interface(s) 106 may include one or more ports for connecting a number of devices to one another or to another server.
The one or more hardware processors 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, portable computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102 includes a plurality of modules 102a and a repository 102b for storing data processed, received, and generated by one or more of the plurality of modules 102a. The plurality of modules 102a may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
The plurality of modules 102a may include programs or computer-readable instructions or coded instructions that supplement applications or functions performed by the system 100. The plurality of modules 102a may also be used as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 102a can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. In an embodiment, the plurality of modules 102a can include various sub-modules (not shown in FIG. 1). Further, the memory 102 may include information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure.
The repository 102b may include a database or a data engine. Further, the repository 102b amongst other things, may serve as a database or includes a plurality of databases for storing the data that is processed, received, or generated as a result of the execution of the plurality of modules 102a. Although the repository 102b is shown internal to the system 100, it will be noted that, in alternate embodiments, the repository 102b can also be implemented external to the system 100, where the repository 102b may be stored within an external database (not shown in FIG. 1) communicatively coupled to the system 100. The data contained within such external database may be periodically updated. For example, data may be added into the external database and/or existing data may be modified and/or non-useful data may be deleted from the external database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). In another embodiment, the data stored in the repository 102b may be distributed between the system 100 and the external database.
Referring to FIG. 2A through FIG. 2C, components and functionalities of the system 100 are described in accordance with an example embodiment of the present disclosure. For example, FIG. 2A through FIG. 2C illustrates exemplary flow diagrams of a processor-implemented method 200 for anomaly detection in a product, in accordance with some embodiments of the present disclosure. Although steps of the method 200 including process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any practical order. Further, some steps may be performed simultaneously, or some steps may be performed alone or independently.
At step 202 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive a plurality of 3-channel RGB images corresponding to each class of a plurality of classes. The plurality of classes includes one or more anomaly classes and a normal class. The one or more anomaly classes are the defect classes. Each anomaly class or a defect class corresponds to one type of a defect associated with the product. For example, the anomaly classes or the defect classes corresponds to cracks (including different level of cracks), dents (including different amounts of dents), and so on. The normal class corresponding to no anomaly or no defect class, in other words a perfect condition or quality. Also, the one or more hardware processors 104 of the system 100 are configured to receive a single-channel binary ground-truth image corresponding to each 3-channel RGB image of the plurality of 3-channel RGB images corresponding to each class.
In an embodiment, a number of the plurality of 3-channel RGB images corresponding to each class may be same or different, i.e., the number of the plurality of 3-channel RGB images corresponding to one class need not be same as that of the number of the plurality of 3-channel RGB images corresponding to any other class. In an embodiment, the plurality of 3-channel RGB images corresponding to each class and the single-channel binary ground-truth image corresponding to each 3-channel RGB image of the plurality of 3-channel RGB images maybe stored in the repository 102b of the system 100. In an embodiment, the size of each 3-channel RGB image is 256X256 pixels. In an embodiment, the size of each single-channel binary ground-truth image is 256X256 pixels. If an actual size of the image is more than 256X256 pixels, then a patch image or a cropped image of interested image portion with 256X256 pixels is created.
At step 204 of the method 200, the one or more hardware processors 104 of the system 100 are configured to form one or more mini-batches from the plurality of 3-channel RGB images corresponding to each class, based on a predefined mini-batch size. Each mini-batch includes one or more 3-channel RGB images out of the plurality of 3-channel RGB images corresponding to each class. The predefined mini-batch size is uniform across the plurality of classes.
As the number of the plurality of 3-channel RGB images corresponding to each class may be same or different, the number of the one or more mini-batches formed for each class also may be different or same. It is implicit that the number of the one or more 3-channel RGB images present in last mini-batch may or may not be equal to the predefined mini-batch size, based on the number of remaining samples available. In an embodiment, the predefined mini-batch size is defined based on the resource availability such as hardware, graphic processing unit (GPU) capacity, and memory present in the system 100.
For example, if the plurality of 3-channel RGB images received at step 202 of the method corresponding to n classes, then the n classes include 1 normal class and (n-1) defect or anomaly classes. For example, if the predefined mini-batch size is to B, then B number of 3-channel RGB images present in each mini-batch corresponding to each class. Then, for i^th class, m_i mini-batches are formed of size B 3-channel RGB images each.
At step 206 of the method 200, the one or more hardware processors 104 of the system 100 are configured to train a multi-class segmentation network, with the one or more 3-channel RGB images present in each mini-batch associated with each class at a time. The plurality of 3-channel RGB images for each class and corresponding single-channel binary ground-truth images received at step 202 of the method 200, forms a training dataset to train the multi-class segmentation network. The training dataset is then divided into number of mini batches corresponding to each class, as mentioned at step 204 of the method 200.
The one or more mini-batches corresponding to each class are then sequentially ingested to the multi-class segmentation network for the training. More specifically, once the mini-batches corresponding to one class are completed, then the mini-batches corresponding to any other class are ingested, and so on, until the mini-batches corresponding to all the plurality of classes are completed. Once the mini-batches corresponding to all the plurality of classes are completed, then it is termed as one training epoch, and the training of the multi-class segmentation network is performed until a predefined training epochs are met. In an embodiment, the predefined training epochs is 200.
The predefined mini-batch size is uniform across all the training epochs, however the one or more mini-batches (with 3-channel RGB images) of a specific class in one training epoch need not be same with the one or more mini-batches of the same class in another training epoch and are randomly chosen per training epoch. Also, the sequence of classes (with mini-batches of 3-channel RGB images) need not be same in each training epoch and are randomly chosen.
FIG. 3 shows a high-level block diagram of the multi-class segmentation network 300, in accordance with some embodiments of the present disclosure. As shown in FIG. 3, the multi-class segmentation network 300 includes an encoder network 302, one or more encoder attention module sets 304, a decoder network 306, one or more decoder attention module sets 308, a multi-scale attention network 310, and a logits formation network 312.
FIGS. 4A and 4B shows an exemplary architecture diagram of the multi-class segmentation network, in accordance with some embodiments of the present disclosure. The encoder network 302 includes a plurality of encoder blocks. The decoder network 306 includes a plurality of decoder blocks. As shown in FIGS. 4A and 4B, the encoder network 302 includes 5 encoder blocks or stages namely a first encoder block (encoder block 1), a second encoder block (encoder block 2), a third encoder block (encoder block 3), a fourth encoder block (encoder block 4), and a fifth encoder block (encoder block 5). The first encoder block (encoder block 1) includes a first convolutional layer (Conv1(4,64)) and a second convolutional layer (Conv2(64,64)). The second encoder block (encoder block 2) includes a third convolutional layer (Conv3(64,128)) and a fourth convolutional layer (Conv4(128,128)).
The third encoder block (encoder block 3) includes a fifth convolutional layer (Conv5(128,256)), a sixth convolutional layer (Conv6(256,256)), and a seventh convolutional layer (Conv7(256,256)). The fourth encoder block (encoder block 4) includes an eighth convolutional layer (Conv8(256,512)), a ninth convolutional layer (Conv9(512,512)), and a tenth convolutional layer (Conv10(512,512)). The fifth encoder block (encoder block 5) includes an eleventh convolutional layer (Con11(512,512)), a twelfth convolutional layer (Conv12(512,512)), and a thirteenth convolutional layer (Conv13(512,512)).
Hence, the encoder network 302 includes 5 encoder blocks with 13 convolutional layers, placed sequentially, as shown in the FIGS. 4A and 4B. After each encoder block, a max-pooling layer is used to down sample the size of the feature maps to half of its size. These features are expectedly further processed for the class-specific patterns.
The one or more encoder attention module sets 304 are connected to the encoder network 302 through the plurality of encoder blocks. Each encoder attention module set is dedicated to each class of the plurality of classes and hence they are termed as class-specific attention modules. Each encoder attention module set includes a plurality of encoder attention modules, wherein a number of the plurality of encoder attention modules present in each encoder attention module set equal to the number of the plurality of classes. Also, as each encoder attention module set is dedicated to each class, the number of the one or more encoder attention module sets is equal to the number of the plurality of classes. For example, if there are n classes, then the number of the one or more encoder attention module sets are equal to n.
As shown in FIGS. 4A and 4B, the one or more encoder attention module sets 304 are ({Atten 1-1, Atten 2-1, Atten 3-1, Atten 4-1, Atten 5-1, Atten 6-1}, {Atten 1-2, Atten 2-2, Atten 3-2, Atten 4-2, Atten 5-2, Atten 6-2}, {Atten 1-3, Atten 2-3, Atten 3-3, Atten 4-3, Atten 5-3, Atten 6-3}, {Atten 1-4, Atten 2-4, Atten 3-4, Atten 4-4, Atten 5-4, Atten 6-4},{Atten 1-5, Atten 2-5, Atten 3-5, Atten 4-5, Atten 5-5, Atten 6-5},…, {Atten 1- n, Atten 2- n, Atten 3- n, Atten 4- n, Atten 5- n, Atten 6- n}). In an embodiment, the encoder attention module set {Atten 1-1, Atten 2-1, Atten 3-1, Atten 4-1, Atten 5-1, Atten 6-1} is associated with first class, the encoder attention module set {Atten 1-2, Atten 2-2, Atten 3-2, Atten 4-2, Atten 5-2, Atten 6-2} is associated with second class, and so on, the last encoder attention module set {Atten 1- n, Atten 2- n, Atten 3- n, Atten 4- n, Atten 5- n, Atten 6- n} is associated with n^th class. Out of the one or more encoder attention module sets, one encoder attention module set is dedicated for the normal class of the plurality of classes.
Similarly, as shown in FIGS. 4A and 4B, the decoder network 306 includes 5 decoder blocks or stages namely a first decoder block (decoder block 1), a second decoder block (decoder block 2), a third decoder block (decoder block 3), a fourth decoder block (decoder block 4), and a fifth decoder block (decoder block 5). The first decoder block (decoder block 1) includes a fourteenth convolutional layer (Conv14(512,512)), a fifteenth convolutional layer (Conv15(512,512)), and a sixteenth convolutional layer (Conv16(512,512)). The second decoder block (decoder block 2) includes a seventeenth convolutional layer (Conv17(512,512)), an eighteenth convolutional layer (Conv18(512,512)), and a nineteenth convolutional layer (Conv19(512,256)).
The third decoder block (decoder block 3) includes a twentieth convolutional layer (Conv20(256,256)), a twenty-first convolutional layer (Conv21(256,256)), and a twenty-second convolutional layer (Conv22(256,128)). The fourth decoder block (decoder block 4) includes a twenty-three convolutional layer (Conv23(128,128)) and a twenty-four convolutional layer (Conv24(128,64)). The fifth decoder block (decoder block 5) includes a twenty-fifth convolutional layer (Conv25(64,64)) and a twenty-sixth convolutional layer (Conv26(64,64)). An up-sampling layer is present before each decoder block.
Like any segmentation subnetwork, the decoder is an opposite function of encoder, which aims to unpack the encoded information of the encoded feature maps into disentangled features. Hence, the decoder network 306 includes 5 decoder blocks with 13 convolutional layers, placed sequentially, as shown in the FIGS. 4A and 4B. Before each decoder block, an up-sampling layer is used to up-sample the size of the feature maps. Since both the encoder network 302 and the decoder network 306 have five blocks, and each block represents feature maps of different sizes, intuitively the encoder network 302 and the decoder network also represent the scale-space representation of the common feature maps.
Similarly, the one or more decoder attention module sets 308 are connected to the decoder network 306 through the plurality of decoder blocks. Each decoder attention module set is dedicated to each class of the plurality of classes and hence they are termed as class-specific attention modules. Each decoder attention module set includes a plurality of decoder attention modules, wherein a number of the plurality of decoder attention modules present in each decoder attention module set equal to the number of the plurality of classes. Also, as each decoder attention module set dedicated to each class, the number of the one or more decoder attention module sets is equal to the number of the plurality of classes. For example, if there are n classes, then the number of the one or more decoder attention module sets are equal to n.
As shown in FIGS. 4A and 4B, the one or more decoder attention module sets are ({Atten 7-1, Atten 8-1, Atten 9-1, Atten 10-1, Atten 11-1}, {Atten 7-2, Atten 8-2, Atten 9-2, Atten 10-2, Atten 11-2}, {Atten 7-3, Atten 8-3, Atten 9-3, Atten 10-3, Atten 11-3}, {Atten 7-4, Atten 8-4, Atten 9-4, Atten 10-4, Atten 11-4},{Atten 7-5, Atten 8-5, Atten 9-5, Atten 10-5, Atten 11-5},…, {Atten 7- n, Atten 8- n, Atten 9- n, Atten 10- n, Atten 11- n}). In an embodiment, the decoder attention module set {Atten 7-1, Atten 8-1, Atten 9-1, Atten 10-1, Atten 11-1} is associated with first class, the decoder attention module set {Atten 7-2, Atten 8-2, Atten 9-2, Atten 10-2, Atten 11-2} is associated with second class, and so on, the last decoder attention module set {Atten 7- n, Atten 8- n, Atten 9- n, Atten 10- n, Atten 11- n} is associated with n^th class. Out of the one or more decoder attention module sets, one decoder attention module set is dedicated for the normal class of the plurality of classes. Hence the 5 encoder blocks and the 5 decoder blocks are shared for the plurality of classes.
The multi-scale attention network 310 includes a first multi-scale concatenation layer, a first multi-scale max-pooling layer, a second multi-scale concatenation layer, a second multi-scale max-pooling layer, a third multi-scale concatenation layer, a third multi-scale max-pooling layer. The multi-scale attention network 310 is connected to the one or more encoder attention module sets 304. More specifically the attention modules present in each encoder attention module set are connected to the first multi-scale concatenation layer, the second multi-scale concatenation layer, and the third multi-scale concatenation layer as shown in the FIGS. 4A and 4B.
The logits formation network 312 includes a first deconvolutional layer (Dconv1(1,1)), a thirty-eighth convolutional layer (Conv38(2,1)), a second deconvolutional layer (Dconv2(1,1)), a thirty-nineth convolutional layer (Conv39(3,1)), a third deconvolutional layer (Dconv3(1,1)), a fortieth convolutional layer (Conv40(4,1)), a fourth deconvolutional layer (Dconv4(1,1)), a fourty-first convolutional layer (Conv41(5,1)), a fifth deconvolutional layer (Dconv5(1,1)), a fourty-second convolutional layer (Conv42(6,1)), a logits concatenation layer, a fourty-third convolutional layer (Conv43(5, 1)), and a soft-max layer. A soft-max activation function is present in the soft-max layer. The decoder attention modules of the one or more decoder attention module sets 308 are connected to the convolutional layers present in the logits formation network 312 as shown in the FIGS. 4A and 4B. Also, the concatenation layers present in the multi-scale attention network 310 are connected to the convolutional layers present in the logits formation network 312 as shown in the FIGS. 4A and 4B.
The architecture of each encoder attention module (for example, Atten 1-1, Atten 2-1, and so on) present in each encoder attention module set is exactly same as that of the architecture of each decoder attention module (for example, Atten 7-1, Atten 7-2, and so on) present in each decoder attention module set, and henceforth termed as simply the attention module. FIG. 5 shows an exemplary architecture diagram of the attention module 500, in accordance with some embodiments of the present disclosure.
Based on the architecture of the attention module shown in FIG. 5, each encoder attention module includes an encoder attention residual unit, an encoder attention logic gate unit, a first encoder attention convolutional layer, an encoder attention soft-max activation layer, and a second encoder attention convolutional layer. The encoder attention residual unit includes a first encoder attention residual unit batch normalization layer, a first encoder attention residual unit ReLU activation layer, a first encoder attention residual unit convolutional layer, a second encoder attention residual unit batch normalization layer, a second encoder attention residual unit ReLU activation layer, a second encoder attention residual unit convolutional layer, and a third encoder attention residual unit convolutional layer.
Similarly, based on the architecture of the attention module shown in FIG. 5, each decoder attention module includes a decoder attention residual unit, a decoder attention logic gate unit, a first decoder attention convolutional layer, an decoder attention soft-max activation layer, and a second decoder attention convolutional layer. The decoder attention residual unit includes a first decoder attention residual unit batch normalization layer, a first decoder attention residual unit ReLU activation layer, a first decoder attention residual unit convolutional layer, a second decoder attention residual unit batch normalization layer, a second decoder attention residual unit ReLU activation layer, a second decoder attention residual unit convolutional layer, and a third decoder attention residual unit convolutional layer.
Now, the training of the multi-class segmentation network 300 with the one or more 3-channel RGB images present in each mini-batch associated with each class are explained in detail through steps 206a through 206h. At step 206a, each of the one or more 3-channel RGB images present in each mini-batch are passed to the encoder network 302 to extract a first set of encoded feature maps of each 3-channel RGB image present in the mini-batch. In an embodiment, each of the one or more 3-channel RGB images present in each mini-batch is put in a tensor before passing into the encoder network 302. More specifically, each 3-channel RGB image is passed to the 5 encoder blocks present in the encoder network 302. An encoded feature map is obtained from each encoder block and hence a five encoded feature maps are obtained for each 3-channel RGB image. The obtained five encoded feature maps form the first set of encoded feature maps corresponding to the 3-channel RGB image.
At step 206b, the first set of encoded feature maps of each 3-channel RGB image obtained at step 206a are passed through corresponding encoder attention module set (of the one or more encoder attention module sets 304) connected to the encoder network 302, to obtain a second set of encoded feature maps of each 3-channel RGB image. As each of the encoder attention module sets comprises the class-specific attention modules, only the encoder attention module set (for example, {Atten 1-1, Atten 2-1, Atten 3-1, Atten 4-1, Atten 5-1, Atten 6-1}) corresponding to the class of the 3-channel RGB image is only activated and the encoded feature map (in the first set of encoded feature maps) obtained from each encoder block is passed to the corresponding encoder attention module present in the corresponding encoder attention module set (for example, {Atten 1-1, Atten 2-1, Atten 3-1, Atten 4-1, Atten 5-1, Atten 6-1}). As each encoder attention module contains attention weights, a weighted encoded feature map is obtained for each encoded feature map in the first set of encoded feature maps. The first set of encoded feature maps are passed to the corresponding attention modules (attention wights) to implement the class-specific self-attention. Thus, the weighted encoded feature maps obtained from the attention modules present in the corresponding encoder attention module set (for example, {Atten 1-1, Atten 2-1, Atten 3-1, Atten 4-1, Atten 5-1, Atten 6-1}), forms the second set of encoded feature maps for each 3-channel RGB image.
Thus, the feature maps at various scales of the encoder network 302, encoding the common pattern part, are further passed through the set of attention modules at each scale. The size of the set at each scale is equal to the number of classes of interest (number of defect classes plus one representing the normal class). These sets of attention modules learn and encode class-specific pattern at each scale. Further, the attention weights of each encoder attention module set are same across the channels of the encoded feature maps, for each class. This attention-driven refinement of common features of the encoder network highlights the details of the defect regions in the feature space.
At step 206c, a first set of decoded feature maps of each 3-channel RGB image are obtained by passing the corresponding first set of encoded feature maps (more specifically the encoded feature map f_e6 of FIGS. 4A and 4B) obtained at step 206a to the decoder network 306. More specifically, each encoded feature map in the first set of encoded feature maps, is up-sampled through the up-sampling layer and passed to the 5 decoder blocks present in the decoder network 206. A decoded feature map is obtained from each decoder block and hence a five decoded feature maps are obtained. The obtained five decoded feature maps form the first set of decoded feature maps corresponding to the 3-channel RGB image.
At step 206d, the first set of decoded feature maps of each 3-channel RGB image obtained at step 206c is passed to through corresponding decoder attention module set (of the one or more decoder attention module sets 308) connected to the decoder network 306, to obtain a second set of decoded feature maps of each 3-channel RGB image. As each of the decoder attention module sets comprises the class-specific attention modules, only the decoder attention module set (for example, {Atten 7-1, Atten 8-1, Atten 9-1, Atten 10-1, Atten 11-1}) corresponding to the class of the 3-channel RGB image is only activated and the decoded feature map (in the first set of decoded feature maps) is passed to the corresponding decoder attention module present in the corresponding decoder attention module set (for example, {Atten 7-1, Atten 8-1, Atten 9-1, Atten 10-1, Atten 11-1}). As each decoder attention module contains the attention weights, the weighted decoded feature map is obtained for each decoded feature map in the first set of decoded feature maps. The first set of decoded feature maps are passed to the corresponding attention modules (attention wights) to implement the class-specific self-attention. Thus, the weighted decoded feature maps obtained from the attention modules present in the corresponding decoder attention module set (for example, {Atten 7-1, Atten 8-1, Atten 9-1, Atten 10-1, Atten 11-1}), forms the second set of decoded feature maps for each 3-channel RGB image. Further, the attention weights of each decoder attention module set are same across the channels of the decoded feature maps, for each class.
At step 206e, the second set of encoded feature maps of each 3-channel RGB image obtained at step 206b, is passed to the multi-scale attention network 310, to obtain a set of scaled feature maps (f_e2,f_e3,f_e4,f_e5 of FIGA. 4A and 4B) of each 3-channel RGB image. The set of scaled feature maps (f_e2,f_e3,f_e4,f_e5 of FIGA. 4A and 4B) are the refined feature maps of the corresponding 3-channel RGB image. The feature map f_e1 is directly obtained from the first encoder block it is a superior layer in the encoder network 302, without passing into the multi-scale attention network 310.
The multi-scale attention network 310 is a single subnet operating upon the different, class-specific outputs of attention modules at various encoder network scales. This multi-scale attention network 310 aggregates and encodes multi-scale defect representation across the classes, till the previous level in the scale space. This aggregation is able to highlight meaningful features from defect of different sizes (of the corresponding region). Specifically, class-specific feature maps input from the earlier stages are concatenated among themselves as well as with all the class-specific feature maps from the lower stages, along the channel dimension, and passed forward for further stages. The use of max-pooling layers in the multi-scale attention network 310 is to down-sample the concatenated maps to bring their spatial size equivalent to the size of the feature maps of the lower stage.
At step 206f, the set of scaled feature maps (f_e1,f_e2,f_e3,f_e4,f_e5 of FIGS. 4A and 4B) of each 3-channel RGB image obtained at step 206e, and the corresponding second set of decoded feature maps obtained at step 206d are passed through the logits formation network 312, to obtain a probability map of each 3-channel RGB image.
Firstly the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps are merged through a set of convolutional-deconvolutional layers ( the first deconvolutional layer (Dconv1(1,1)), the thirty-eighth convolutional layer (Conv38(2,1)), the second deconvolutional layer (Dconv2(1,1)), the thirty-nineth convolutional layer (Conv39(3,1)), the third deconvolutional layer (Dconv3(1,1)), the fortieth convolutional layer (Conv40(4,1)), the fourth deconvolutional layer (Dconv4(1,1)), the forty-first convolutional layer (Conv41(5,1)), the fifth deconvolutional layer (Dconv5(1,1)), the forty-second convolutional layer (Conv42(6,1), of FIGS. 4A and 4B), present in the logits formation network 310, to obtain a set of aggregated feature maps (f_ed1,f_ed2,f_ed3,f_ed4,f_ed5 of FIGS. 4A and 4B) of each 3-channel RGB image. More specifically, the set of scaled feature maps (f_e1,f_e2,f_e3,f_e4,f_e5 of FIGS. 4A and 4B) and the corresponding second set of decoded feature maps are aggregated at all scales individually to obtain the set of aggregated feature maps of each 3-channel RGB image.
In the logits formation network 312, the feature maps are used as logits for prediction of class for various pixels. Since in the logits formation network 312, the feature maps are used as logits for prediction of class for various pixels, instead of aggregating class-specific features from the decoder network 306 at various scales, the class-specific multiscale features from the multi-scale attention network 310, and the class-specific disentangled features from decoder network 306 are first aggregated and then filtered at various scales to decode class-specific features into output logits.
The logits formation network comprises 5 stages. At each/ ith stage of the subnet, the feature maps from the ith stage of the multi-scale attention network 310 are concatenated along the channel dimension with all of the class-specific disentangled feature maps, arising from the application of attention modules on from the ith stage feature map outputs of the decoder network 306 along the channel dimension, in a class-specific way.
Next, the aggregated feature maps present in the set of aggregated feature maps (f_ed1,f_ed2,f_ed3,f_ed4,f_ed5 of FIGS. 4A and 4B) of each 3-channel RGB image are concatenated using the logits concatenation layer (Concatenate of FIGS. 4A and $B) present in the logits formation network 312, to obtain a concatenated feature map of each 3-channel RGB image.
Lastly, the concatenated feature map of each 3-channel RGB image is passed through the convolutional layer (forty-third convolutional layer (Conv43(5, 1) of FIGS. 4A and 4B), and a soft-max activation function (softmax layer of FIGS. 4A and 4B) present in the logits formation network 312, to predict the probability map of each 3-channel RGB image.
The output of the network is (n) channel feature maps, where n-1 is the number of “defect” classes, and ‘1’ is class related to normal/no-defect samples. Accordingly, the activation function is softmax. Further, there is no individual bias to a specific defect category e.g., cracks, the loss of the multi-class segmentation network 300 is calculated and optimized based on the predicted probability map.
At step 206g, a value of a multi-class cross-entropy loss function of the multi-class segmentation network 300, for the one or more 3-channel RGB images present in the mini-batch, using the obtained (predicted) probability map of each 3-channel RGB image at step 206f and the corresponding single-channel binary ground-truth image. The value of the multi-class cross-entropy loss function is calculated for all the 3-channel RGB images present in the mini-batch.
The multi-class cross-entropy loss function (CE) of the multi-class segmentation network 300 for the one or more 3-channel RGB images present in the min-batch is mathematically expressed as equation 1:
CE=-1/(k*B)(?_(ch=1)^k¦?_(i=1)^B¦?y_(i,ch)*log??(p_(i,ch)))? ? ---------------------------------- (1)
wherein y indicates the corresponding single-channel binary ground-truth image of the for each 3-channel RGB image and p indicates the predicted probability map of the 3-channel RGB image obtained at step 206h, B indicates the number of the one or more 3-channel RGB images present in the mini-batch, and k indicates a number of channels of the predicted probability map of the 3-channel RGB image. Further, the number of channels is equal to the number of classes of the plurality of classes.
At step 206h, the weights of the multi-class segmentation network 300, are updated through back propagation, based on the calculated value of the multi-class cross-entropy loss function of the multi-class segmentation network for the one or more 3-channel RGB images present in the mini-batch. As the multi-class segmentation network 300 contains the class-specific attention module set at both the encoder network 302 and the decoder network 306, only the corresponding encoder attention module set and the corresponding decoder attention module set associated with the class (of the one or more 3-channel RGB images present in the mini-batch) are only activated along with the multiscale attention network 310 and the logits formation network 312, for and the weights associated with them are only updated through the back-propagation. Thus, the remaining encoder attention module sets and the remaining decoder attention module sets are temporarily deactivated from the multiscale attention network 310. Training of the multi-class segmentation network 300 is optimized using Stochastic Gradient Descent, with learning rate equals to 0.0001, momentum is 0.9, and weight decay equals to 0.0002.
Once the training of the one or more 3-channel RGB images present in one mini-batch are completed and the corresponding network weights are updated, the updated multi-class segmentation network is then trained next with the one or more 3-channel RGB images present in the next mini-batch associated with same class. Once the one or more mini-batches associated with one class are completed, the updated multi-class segmentation network is then trained with the one or more mini-batches associated with next class. Like, this once all the mini-batches associated with all the classes (the plurality of classes completed), the training of the multi-class segmentation network is again continued for next training epoch until the predefined training epochs are completed to obtain the trained multi-class segmentation network. The trained multi-class segmentation network is then validated with the validation data set for finetuning the network weights. The fine-tuned multi-class segmentation network is then used for detecting the anomalies present in the product by passing the corresponding 3-channel RGB image.
At step 208 of the method 200, the one or more hardware processors 104 of the system 100 are configured to receive an input 3-channel RGB image of a product, for which the anomaly or the multiple anomalies to be detected. The size of the input 3-channel RGB image is of 256X256 pixels. At step 210 of the method 200, the one or more hardware processors 104 of the system 100 are configured to passing the input 3-channel RGB image received at step 208, to the trained multi-class segmentation network (after validation) obtained at step 206, to obtain an input probability map of the input 3-channel RGB image. The input probability map of the input 3-channel RGB image comprises a probability value of each class of the plurality of classes for which the trained multi-class segmentation network is obtained. The plurality of classes comprises of the one or more anomaly classes and the normal class or the no-defect class. In an embodiment, the size of the input 3-channel RGB image is 256X256 pixels. If an actual size of the image is more than 256X256 pixels, then a patch image or a cropped image of interested image portion with 256X256 pixels is created.
Lastly, at step 212 of the method 200, the one or more hardware processors 104 of the system 100 are configured to detect the presence of anomaly in the product, based on the input probability map of the input 3-channel RGB image. More specifically, based on the probabilities associated with each class, the one or more anomaly classes or the normal class is detected.
The method and systems of the present disclosure firstly attempts to model and learn the specific pattern in various kinds of anomalies in manufacturing images, even when such patterns have high intraclass variation. In process of that, it learns class-specific relative attention weights in the representation space, to learn better representation of each of the patterns, normal or specific defect. The class-specific attention modules at both encoder network 302 and a decoder network 306 of the multi-class segmentation network 300, are used to discriminate among various kinds of defects and localize them with robust accuracy, even in case of high intraclass variation of the defects.
The embodiments of present disclosure herein address unresolved problem of multi-class anomaly detection in the manufactured product, by using the class-specific attention modules at both encoder network and the decoder network of the multi-class segmentation network, to discriminate among various kinds of defects and localize them with robust accuracy, even in case of high intraclass variation of the defects. Though the methods and systems of the present disclosure is mainly to detect anomalies in the manufacturing products, the scope of the invention is not limited to detect anomalies in the non-manufacturing products.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims. , Claims:We Claim:
1. A processor-implemented method (200) comprising the steps of:
receiving, via one or more hardware processors, a plurality of 3-channel RGB images corresponding to each class of a plurality of classes, and a single-channel binary ground-truth image corresponding to each 3-channel RGB image of the plurality of 3-channel RGB images, wherein the plurality of classes comprises of one or more anomaly classes and a normal class (202);
forming, via the one or more hardware processors, one or more mini-batches from the plurality of 3-channel RGB images corresponding to each class, based on a predefined mini-batch size, wherein each mini-batch comprises one or more 3-channel RGB images out of the plurality of 3-channel RGB images corresponding to each class (204); and
training, via the one or more hardware processors, a multi-class segmentation network, with the one or more 3-channel RGB images present in each mini-batch associated with each class at a time, until the one or more mini-batches associated with the plurality of classes are completely ingested sequentially for a predefined training epochs, wherein the multi-class segmentation network comprises an encoder network, one or more encoder attention module sets, a decoder network, one or more decoder attention module sets, a multi-scale attention network, and a logits formation network (206), and wherein training the multi-class segmentation network with the one or more 3-channel RGB images present in each mini-batch associated with each class comprises:
extracting a first set of encoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding 3-channel RGB image to the encoder network (206a);
passing the first set of encoded feature maps of each 3-channel RGB image through corresponding encoder attention module set connected to the encoder network, to obtain a second set of encoded feature maps of each 3-channel RGB image (206b);
extracting a first set of decoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding first set of encoded feature maps to the decoder network (206c);
passing the first set of decoded feature maps of each 3-channel RGB image through corresponding decoder attention module set connected to the decoder network, to obtain a second set of decoded feature maps of each 3-channel RGB image (206d);
passing the second set of encoded feature maps of each 3-channel RGB image to the multi-scale attention network, to obtain a set of scaled feature maps of each 3-channel RGB image (206e);
passing the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain a probability map of each 3-channel RGB image (206f);
calculating value of a multi-class cross-entropy loss function of the multi-class segmentation network, for the one or more 3-channel RGB images present in the mini-batch, using the obtained probability map of each 3-channel RGB image and the corresponding single-channel binary ground-truth image (206g); and
updating weights of the multi-class segmentation network, based on the calculated value of the multi-class cross-entropy loss function of the multi-class segmentation network (206h).

2. The method of claim 1, wherein:
(i) the encoder network comprises a plurality of encoder blocks;
(ii) the decoder network comprises a plurality of decoder blocks;
(iii) the one or more encoder attention module sets is connected to the encoder network, wherein each encoder attention module set comprises a plurality of encoder attention modules dedicated to each class of the plurality of classes; and
(iv) the one or more decoder attention module sets is connected to the decoder network, wherein each decoder attention module set comprises the plurality of decoder attention modules dedicated to each class of the plurality of classes.

3. The method of claim 2, wherein:
each encoder attention module comprises an encoder attention residual unit, an encoder attention logic gate unit, a first encoder attention convolutional layer, an encoder attention soft-max activation layer, and a second encoder attention convolutional layer, and wherein the encoder attention residual unit comprises a first encoder attention residual unit batch normalization layer, a first encoder attention residual unit ReLU activation layer, a first encoder attention residual unit convolutional layer, a second encoder attention residual unit batch normalization layer, a second encoder attention residual unit ReLU activation layer, a second encoder attention residual unit convolutional layer, and a third encoder attention residual unit convolutional layer; and
each decoder attention module comprises a decoder attention residual unit, a decoder attention logic gate unit, a first decoder attention convolutional layer, an decoder attention soft-max activation layer, and a second decoder attention convolutional layer, and wherein the decoder attention residual unit comprises a first decoder attention residual unit batch normalization layer, a first decoder attention residual unit ReLU activation layer, a first decoder attention residual unit convolutional layer, a second decoder attention residual unit batch normalization layer, a second decoder attention residual unit ReLU activation layer, a second decoder attention residual unit convolutional layer, and a third decoder attention residual unit convolutional layer.

4. The method of claim 1, wherein a number of each of (i) the one or more encoder attention module sets connected to the encoder network, and (ii) the one or more decoder attention module sets connected to the decoder network, is equal to a number of the plurality of classes.

5. The method of claim 1, wherein during training the multi-class segmentation network with each of the one or more 3-channel RGB images present in each mini-batch associated with each class, (i) the corresponding encoder attention module set, (ii) the corresponding decoder attention module set, (iii) the multiscale attention network, (iv) the logits formation network, (v) the encoder network, and (vi) the decoder network, are activated for back-propagation.

6. The method of claim 1, wherein passing the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain the probability map of each 3-channel RGB image, comprises:
merging the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps through a set of convolutional-deconvolutional layers present in the logits formation network, to obtain a set of aggregated feature maps of each 3-channel RGB image;
concatenating the set of aggregated feature maps of each 3-channel RGB image through a logits concatenation layer present in the logits formation network, to obtain a concatenated feature map of each 3-channel RGB image; and
passing the concatenated feature map of each 3-channel RGB image, through a convolutional layer and a soft-max activation function present in the logits formation network, to obtain the probability map of each 3-channel RGB image.

7. The method of claim 1, further comprising:
receiving, via the one or more hardware processors, an input 3-channel RGB image of a product, for which the anomaly to be detected (208);
passing, via the one or more hardware processors, the input 3-channel RGB image to the trained multi-class segmentation network, to obtain an input probability map (210); and
detecting, via the one or more hardware processors, the presence of anomaly in the product, based on the input probability map (212).

8. A system (100) comprising:
a memory (102) storing instructions;
one or more input/output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a plurality of 3-channel RGB images corresponding to each class of a plurality of classes, and a single-channel binary ground-truth image corresponding to each 3-channel RGB image of the plurality of 3-channel RGB images, wherein the plurality of classes comprises of one or more anomaly classes and a normal class
form one or more mini-batches from the plurality of 3-channel RGB images corresponding to each class, based on a predefined mini-batch size, wherein each mini-batch comprises one or more 3-channel RGB images out of the plurality of 3-channel RGB images corresponding to each class; and
train a multi-class segmentation network, with the one or more 3-channel RGB images present in each mini-batch associated with each class at a time, until the one or more mini-batches associated with the plurality of classes are completely ingested sequentially for a predefined training epochs, wherein the multi-class segmentation network comprises an encoder network, one or more encoder attention module sets, a decoder network, one or more decoder attention module sets, a multi-scale attention network, and a logits formation network, and wherein training the multi-class segmentation network with the one or more 3-channel RGB images present in each mini-batch associated with each class comprises:
extracting a first set of encoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding 3-channel RGB image to the encoder network;
passing the first set of encoded feature maps of each 3-channel RGB image through corresponding encoder attention module set connected to the encoder network, to obtain a second set of encoded feature maps of each 3-channel RGB image;
extracting a first set of decoded feature maps of each 3-channel RGB image present in the mini-batch, by passing the corresponding first set of encoded feature maps to the decoder network;
passing the first set of decoded feature maps of each 3-channel RGB image through corresponding decoder attention module set connected to the decoder network, to obtain a second set of decoded feature maps of each 3-channel RGB image;
passing the second set of encoded feature maps of each 3-channel RGB image to the multi-scale attention network, to obtain a set of scaled feature maps of each 3-channel RGB image;
passing the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain a probability map of each 3-channel RGB image;
calculating value of a loss function of the multi-class segmentation network, for the one or more 3-channel RGB images present in the mini-batch, using the obtained probability map of each 3-channel RGB image and the corresponding single-channel binary ground-truth image; and
updating weights of the multi-class segmentation network, based on the calculated value of the loss function of the multi-class segmentation network.

9. The system of claim 8, wherein:
(i) the encoder network comprises a plurality of encoder blocks;
(ii) the decoder network comprises a plurality of decoder blocks;
(iii) the one or more encoder attention module sets is connected to the encoder network, wherein each encoder attention module set comprises a plurality of encoder attention modules dedicated to each class of the plurality of classes; and
(iv) the one or more decoder attention module sets is connected to the decoder network, wherein each decoder attention module set comprises a plurality of decoder attention modules dedicated to each class of the plurality of classes.

10. The system of claim 9, wherein:
each encoder attention module comprises an encoder attention residual unit, an encoder attention logic gate unit, a first encoder attention convolutional layer, an encoder attention soft-max activation layer, and a second encoder attention convolutional layer, and wherein the encoder attention residual unit comprises a first encoder attention residual unit batch normalization layer, a first encoder attention residual unit ReLU activation layer, a first encoder attention residual unit convolutional layer, a second encoder attention residual unit batch normalization layer, a second encoder attention residual unit ReLU activation layer, a second encoder attention residual unit convolutional layer, and a third encoder attention residual unit convolutional layer; and
each decoder attention module comprises a decoder attention residual unit, a decoder attention logic gate unit, a first decoder attention convolutional layer, an decoder attention soft-max activation layer, and a second decoder attention convolutional layer, and wherein the decoder attention residual unit comprises a first decoder attention residual unit batch normalization layer, a first decoder attention residual unit ReLU activation layer, a first decoder attention residual unit convolutional layer, a second decoder attention residual unit batch normalization layer, a second decoder attention residual unit ReLU activation layer, a second decoder attention residual unit convolutional layer, and a third decoder attention residual unit convolutional layer.

11. The system of claim 8, wherein a number of each of (i) the one or more encoder attention module sets connected to the encoder network, and (ii) the one or more decoder attention module sets connected to the decoder network, is equal to a number of the plurality of classes.

12. The system of claim 8, wherein the one or more hardware processors (104) are configured to activate (i) the corresponding encoder attention module set, (ii) the corresponding decoder attention module set, (iii) the multiscale attention network, (iv) the logits formation network, (v) the encoder network, and (vi) the decoder network, during training the multi-class segmentation network with each of the one or more 3-channel RGB images present in each mini-batch associated with each class, for back-propagation.

13. The system of claim 8, wherein the one or more hardware processors (104) are configured to pass the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps to the logits formation network, to obtain the probability map of each 3-channel RGB image, by:
merging the set of scaled feature maps of each 3-channel RGB image and the corresponding second set of decoded feature maps through a set of convolutional-deconvolutional layers present in the logits formation network, to obtain a set of aggregated feature maps of each 3-channel RGB image;
concatenating the set of aggregated feature maps of each 3-channel RGB image through a logits concatenation layer present in the logits formation network, to obtain a concatenated feature map of each 3-channel RGB image; and
passing the concatenated feature map of each 3-channel RGB image, through a convolutional layer and a soft-max activation function present in the logits formation network, to obtain the probability map of each 3-channel RGB image.

14. The system of claim 8, wherein the one or more hardware processors (104) are further configured to:
receive an input 3-channel RGB image of a product, for which the anomaly to be detected;
pass the input 3-channel RGB image to the trained multi-class segmentation network, to obtain an input probability map; and
detect the presence of anomaly in the product, based on the input probability map.

Dated this 14th Day of June 2022
Tata Consultancy Services Limited
By their Agent & Attorney

(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086

Documents

Application Documents

#	Name	Date
1	202221034104-STATEMENT OF UNDERTAKING (FORM 3) [14-06-2022(online)].pdf	2022-06-14
2	202221034104-REQUEST FOR EXAMINATION (FORM-18) [14-06-2022(online)].pdf	2022-06-14
3	202221034104-PROOF OF RIGHT [14-06-2022(online)].pdf	2022-06-14
4	202221034104-FORM 18 [14-06-2022(online)].pdf	2022-06-14
5	202221034104-FORM 1 [14-06-2022(online)].pdf	2022-06-14
6	202221034104-FIGURE OF ABSTRACT [14-06-2022(online)].jpg	2022-06-14
7	202221034104-DRAWINGS [14-06-2022(online)].pdf	2022-06-14
8	202221034104-DECLARATION OF INVENTORSHIP (FORM 5) [14-06-2022(online)].pdf	2022-06-14
9	202221034104-COMPLETE SPECIFICATION [14-06-2022(online)].pdf	2022-06-14
10	Abstract1.jpg	2022-08-29
11	202221034104-FORM-26 [20-09-2022(online)].pdf	2022-09-20
12	202221034104-Request Letter-Correspondence [12-06-2023(online)].pdf	2023-06-12
13	202221034104-Power of Attorney [12-06-2023(online)].pdf	2023-06-12
14	202221034104-Form 1 (Submitted on date of filing) [12-06-2023(online)].pdf	2023-06-12
15	202221034104-Covering Letter [12-06-2023(online)].pdf	2023-06-12
16	202221034104-FORM 3 [15-06-2023(online)].pdf	2023-06-15
17	202221034104-CORRESPONDENCE(IPO)-(WIPO DAS)-(15-06-2023)..pdf	2023-06-15
18	202221034104-FER.pdf	2025-04-15
19	202221034104-Information under section 8(2) [14-07-2025(online)].pdf	2025-07-14
20	202221034104-FORM 3 [14-07-2025(online)].pdf	2025-07-14
21	202221034104-FER_SER_REPLY [25-09-2025(online)].pdf	2025-09-25

Search Strategy

1	SearchHistory(85)E_01-03-2024.pdf
2	NPL2E_01-03-2024.pdf