Abstract: This disclosure relates generally to method and system for co-training a pair of base classifiers for detection of damaged objects. Detecting objects that are damaged from an infrastructure is difficult due to limited availability of labeled data. To collate the labeled data requires human effort and which is time consuming. The proposed disclosure processes unlabeled data which gets co-trained with the learning pair of base classifiers. Further, the learned pair of base classifiers determines true label for the unlabeled data. The determined true label for the unlabeled data are recorded into the labeled data which further expands the supplementary unlabeled data as labeled data iteratively. The proposed disclosure is capable of detecting damaged objects with higher accuracy rate utilizing the pair of base classifiers thereby reducing time and expanding the labeled data which can be further utilized for detection of new objects in real time.
DESC:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR CO-TRAINING A PAIR OF BASE CLASSIFIERS FOR DETECTING DAMAGED OBJECTS
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[001] The present application claims priority from Indian patent application no. (201821008840), filed on 09th March, 2018 the complete disclosure of which, in its entirety is herein incorporated by reference.
TECHNICAL FIELD
[002] The disclosure herein generally relates to training of classifiers and, more particularly, to method and system for co-training a pair of base classifiers for detecting damaged objects.
BACKGROUND
[003] Numerous tragic and natural disasters threat vulnerable areas of the world causing extensive destruction to infrastructure, flatten buildings and dramatically changes the land surface. After a disaster, knowing the location and extent of detecting damaged buildings over a large affected area is vital for emergency response actions and rescue work. In such cases, especially where an internal state of the object cannot be visibly accessed, it would be difficult to detect the objects that are damaged vulnerably for providing repair or replacement. In such scenarios, establishing a system for detecting objects that are damaged from a surveillance video is an emergent solution for providing a reliable system in order to broadly facilitate.
[004] Most of the conventional methods, for detection of damaged objects using semi-supervised learning requires sufficient training data for training a classifier that performs damaged object classification. The training data required for training the classifier is collected manually where human effort is required for annotation which increases time and cost to achieve accurate predictions. However, these conventional methods limits in providing training data required for training the classifier necessary for performing damaged object detection providing lower accuracy.
[005] Thus, generation of labeled data for training the classifier is required for detecting damaged objects due to limited availability of labeled data.
SUMMARY
[006] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for co-training a pair of base classifiers for detection of damaged objects is provided. The system includes a processor, an Input/output (I/O) interface and a memory coupled to the processor is capable of executing programmed instructions stored in the processor in the memory to pre-process, a plurality of image frames of a plurality of video streams to obtain a training data. The preprocessing comprises labeling a set of image frames from the plurality of image frames to generate labeled data and tagging remaining image frames from the plurality of image frames as unlabeled data. Here, the labeling data comprises associating a pixel mask depicting a region of interest for a damaged object of each image frame from the set of image frames. Further, the system co-trains the first base classifier and the second base classifier using the labeled data associated with the training data marked with region of interest. Further, the system iteratively re-trains, the first base classifier and the second base classifier using the unlabeled data, wherein the unlabeled data is identified with a true label for each iteration. Further, the labeled data gets expanded from the supplementary unlabeled data and its true label for detecting new instances as damaged objects or undamaged objects.
[007] In another aspect, a method for co-training a pair of base classifiers for detection of damaged objects is provided. The method includes a processor, an Input/output (I/O) interface and a memory coupled to the processor is capable of executing programmed instructions stored in the processor in the memory to pre-process, a plurality of image frames of a plurality of video streams to obtain a training data. The preprocessing comprises labeling a set of image frames from the plurality of image frames to generate labeled data and tagging remaining image frames from the plurality of image frames as unlabeled data. Here, the labeling data comprises associating a pixel mask depicting a region of interest for a damaged object of each image frame from the set of image frames. Further, the system co-trains the first base classifier and the second base classifier using the labeled data associated with the training data marked with region of interest. Further, the system iteratively re-trains, the first base classifier and the second base classifier using the unlabeled data, wherein the unlabeled data is identified with a true label for each iteration. Further, the labeled data gets expanded from the supplementary unlabeled data and its true label for detecting new instances as damaged objects or undamaged objects.
[008] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[009] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[010] FIG.1 illustrates an overview of an environment for co-training a pair of base classifiers for detecting damaged objects, in accordance with an example embodiment of the present disclosure.
[011] FIG. 2 illustrates an exemplary block diagram of the a system, of the environment of FIG. 1, used for co-training the pair of base classifiers for detecting damaged objects, in accordance with another embodiment of the present disclosure.
[012] FIG.3 is an exemplary architecture of the system of FIG. 2 depicting co-training of the pair of base classifiers for true labeling of unlabeled data used during training the pair of base classifiers of the system of FIG. 2, in accordance with an embodiment of the present disclosure.
[013] FIG.4 illustrates an exemplary flow diagram of a method for co-training the pair of base classifiers for detecting damaged objects using the system of FIG. 1, in accordance with an embodiment of the present disclosure.
[014] FIG.5A and 5B illustrates a representation of a sequence of image frames processed using the first feature extractor and the second feature extractor, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[015] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[016] With the current state of the art, it is difficult to detect damaged objects from an infrastructure using supervised learning technique due to limited amount of labeled data. Here, labeled data refers to faulty dataset that generates a sample set for training the classifier for performing detection. The challenge pertains to the problem of detection of damaged objects from a plurality of video streams in real time. Generation of labeled data for training the classifier is difficult due to the availability of limited labeled data. However, for training the classifier, the system requires large set of annotated training data that enables the trained classifier for performing damaged object detection.
[017] Training a classifier learner using imbalanced datasets has been a longstanding problem due to lack of real datasets that do not have balanced ratio of training samples for each class in binary classification. This problem occurs due to inherent imbalance problem when prior probabilities of various classes significantly differ. In such cases, majority of samples belong to one class typically the negative instances in case of binary classification, and far fewer to the other classes. Accordingly, the posterior probabilities of detected objects also differ. Standard training algorithms, when trained over imbalanced datasets, get often biased towards majority class such as the negative instances. This leads to higher misclassification rate for the minority class samples such as the positive instances.
[018] The embodiments herein provides a method and system for co-training a pair of base classifiers for detecting damaged objects. The pair of base classifiers are co-trained iteratively with different views of a feature space representation of unlabeled data obtained from a plurality of image frames associated with a plurality of video streams. These pair of co-trained base classifiers may then rank the unlabeled data for obtaining a true label based on the higher confidence in an iterative manner. The present disclosure enables to collect large volumes of unlabeled data and identify true labels for the unlabeled data using the pair of base classifiers. The system is capable of acquiring new unlabeled data in real time and the pair of base classifiers are co-trained iteratively with the new instances, which are assigned as true labels for collating more training data for detection of damaged objects. The dynamic updation of training data generates large volume of labelled data which improves the classification accuracy and provides more accurate identification of damaged objects.
[019] Referring now to the drawings, and more particularly to FIG. 1 through FIG.4, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[020] FIG.1 illustrates an overview of an environment for co-training a pair of base classifiers for detecting damaged objects, in accordance with an example embodiment of the present disclosure. The base classifier co-training system 102, alternatively referred as system 102, is configured to receive a plurality of video streams from one or more external sources 106. The system 102 may be embodied in a computing device, for instance a computing device 104. Although the present disclosure is explained considering that the system 102 is implemented on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. In one implementation, the system 102 may be implemented in a cloud-based environment. It will be understood that the system 102 may be accessed by multiple users through one or more user devices or applications residing on the user devices.
[021] In an embodiment, a network 108, which transmits a plurality of image frames from the data source 106 to the computing device 104, may be a wireless or a wired network, or a combination thereof. In an example, the network 108 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 108 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network 108 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network 108 may interact with the system 102 through communication links. In an embodiment, the computing device 104, which implements the system 102 can be a workstation, a mainframe computer, a general purpose server, a network server or the like. The components and functionalities of the system 102 are described further in detail with reference to FIG. 2 and FIG.3.
[022] FIG. 2 illustrates an exemplary block diagram of the a system of the environment of FIG. 1, used for co-training the pair of base classifiers for detecting damaged objects, in accordance with another embodiment of the present disclosure. In an embodiment, the system 102 includes processor (s) 204, communication interface device(s), alternatively referred as or input/output (I/O) interface(s) 206, and one or more data storage devices or memory 208 operatively coupled to the processor (s) 204. The processor (s), alternatively referred as one or more processors 204 may be one or more software processing modules and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 102 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[023] The I/O interface(s) 206 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server. The I/O interface 106, through the ports is configured to receive the plurality of video streams from data sources 106.
[024] The memory 208 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment the memory 208, may store the pair of base classifiers. Further, the memory 208 may include a repository 210 for storing the training data for the pair of base classifiers. In an embodiment, the data source 106 may be external (not shown) to the system 102 and accessed through the I/O interfaces 206. The memory 208 may further comprise information pertaining to input(s)/output(s) of each step performed the system 102 and methods of the present disclosure.
[025] FIG.3 is an exemplary architecture performing co-training of the pair of base classifiers for true labeling the unlabeled data FIG.2, in accordance with an embodiment of the present disclosure. The system 102 includes a region proposal component, a feature extraction component, a bilateral transformation component, and a co-training component. Initially, the system 102 receives a plurality of video streams captured in sequence using a motion-capturing device. The data source 106 of the system 102 records a plurality of image frames associated with the plurality of video streams. The region proposal component generates multiple regions of interest for each image of the unlabeled data, to obtain a set of possible objects. The proposed regions of interest from the unlabeled data are fed into each of the feature extraction component, for obtaining feature sets. The feature extraction component comprises a first feature extractor and a second feature extractor. The first feature extractor comprises a fast region based convolutional neural network for extracting a first set of features from each region proposal, for analysis of the first base classifier. The second feature extractor comprises a spatial pyramid pooling network for extracting a second set of features from each region proposal, for analysis of the second base classifier. The first feature extractor is coupled to the bilateral transformation component for further process. The bilateral transformation component comprises a first bilateral transformation and a second bilateral transformation. The first bilateral transformation transforms the first set of features into vectors for inputting to the first base classifiers associated with the pair of base classifiers. The second bilateral transformation transforms the second set of features into vectors for inputting to the second base classifier associated with the pair of base classifiers. The co-training component comprises of a pair of base classifiers, called the first base classifier and the second base classifier. Each of the base classifier determines the true label for the unlabeled data. Further, the pair of base classifiers are co-trained with new labeled instances, post assignment of true labels, iteratively, thereby enabling the system to detect damaged objects in real time. For performing co-training with the pair of base classifiers, initially the system collects training data such as the labeled data. The labeled data or the set of image frames are manually labelled as damaged objects for each image frames. The labelled data is further utilized individually for training the pair of base classifiers to classify the input image frames into damaged objects or as undamaged objects. Here, the labeled data is associated with a pixel mask depicting a region of interest corresponding to the damaged object.
[026] Further, the system utilizes co-trained pair of base classifiers for processing the unlabeled data and obtains true label by detecting new input image frames as damaged objects or undamaged objects. The unlabeled data is processed to generate region proposals, which are then simultaneously passed through the two feature extractor component including the first feature extractor and the second feature extractor. Each of the feature extractor components is pre-trained using weighted triplet loss function approach to extract two set of features captured with two different dimensions. The first set of features extracted from the first feature extractor are transformed into a first set of feature space representation using the first bilateral transformation component. The second set of features extracted from the second feature extractor are transformed into a second set of feature space representation using the second bilateral transformation component. Further, the first base classifier obtains the first set of feature space representations as its input, and the second base classifier obtains the second set of feature space representations as its input, for predicting a true label of each element of the set of regions of interest for the unlabeled data. The true label is predicted for the regions of interest for the unlabeled data based on the higher confidence value obtained from the first base classifier and the second base classifier. Further, the predicted true label and associated regions are used for co-training the first base classifier and the second base classifier. The labeled dataset is thereafter expanded from the supplementary unlabeled data and its true label.
[027] FIG.4 illustrates an exemplary flow diagram of a method for training the pair of base classifiers for detecting damaged objects using the system of FIG. 1, in accordance with an embodiment of the present disclosure and FIG.5A and 5B illustrates a representation of a sequence of image frames processed using the first feature extractor and the second feature extractor, in accordance with an embodiment of the present disclosure. In an embodiment, the system 102 comprises one or more data storage devices or the memory 208 operatively coupled to the one or more processors 204 and is configured to store instructions for execution of steps of the method 400 by the one or more processors (alternatively referred as processor(s)) 204 in conjunction with various components of the system 102. The steps of the method 400 of the present disclosure will now be explained with reference to the components or blocks of the system 102 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 1 through 5. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[028] Referring to the steps of method 400, at step 402 of the method 400, the processor 204 is configured to pre-process, the plurality of image frames from the plurality of video streams to obtain training data, wherein preprocessing comprises labeling a set of image frames from the plurality of image frames to generate labeled data and tagging remaining image frames from the plurality of image frames as unlabeled data. The labeling comprises associating a pixel mask depicting a region of interest for an object of each image frame from the set of image frames. In the initial training data, only few of image frames from the plurality of image frames are labeled manually. In the initial iterations, image as well as the pixel mask depicting the label of a region of interest both are fed to the algorithm pipeline.
[029] At step 404 of the method 400, the processor 204 is configured to individually train, the first base classifier and the second base classifier using the labeled data associated with the training data manually marked with region of interest.
[030] At step 406 of the method 400, the processor 204 is configured to iteratively retrain the first base classifier and the second base classifier using the unlabeled data, wherein the unlabeled data is identified with a true label for each iteration. To re-train the pair of base classifiers with the unlabeled data, the method identifies a set of regions from the unlabeled data as possible objects of interest. The possible objects of interest are localized using the region proposal technique which is known in the art. For every sequential iteration, the pair of base classifiers are co-trained with the untagged data as unlabeled data. The first base classifier and the second base classifier assigns true label to the unlabeled data based on the higher confidence score. For the unlabeled data the set of possible regions are proposed for each image frame for region-level labeling and then the set of regions of interest for each image frame are generated. Further, the proposed regions of interest are then fed to one view, associated with one base classifier, for deriving the confidence measure. The region proposals with higher confidence value are then transferred to the other view as true labels, expanding the labeled data associated with the other view. The base classifier associated with the other view is further retrained using its expanded, labeled training data. In the next iteration, it is the other base classifier from the pair of base classifiers, associated with the other view, which labels each of the region of interest from the next unlabeled frame, and transfers confidently labeled proposals to the first view, as additional labeled data, for re-training of the first base classifier.
[031] Further, the method extracts from the unlabeled data, the first set of features using the first feature extractor and the second set of features using the second feature extractor, wherein the first feature extractor and the second feature extractor are pre-trained using the weighted triplet loss function. The first feature extractor comprises the fast region proposal network receiving the set of regions as possible objects of interest of the unlabeled data using the region proposal technique for extracting the first set of feature sets. The second feature extractor comprising of the spatial proposal network receives the set of regions as possible objects of interest of the unlabeled dataset using the region proposal technique for extracting the second set of feature set. In an embodiment, pre-training the first feature extractor and the second feature extractor using the weighted triplet loss function is performed over multiple steps. The steps further comprise of processing the unlabeled data for every previous iteration, based on the prediction confidence score obtained from the pair of base classifiers. The corresponding positive image regions of interest instance are shortlisted from the set of region proposals of the unlabeled data. The corresponding negative image regions of interest instance are shortlisted from the set of region proposals of the unlabeled data.
[032] Further, the newly determined region of interest instances from the set of region proposals are extracted from the unlabeled data which then extends the existing data set of the corresponding known positive image region of interest instances and the negative image region of interest instances. The pair of base classifiers are binary classifiers. The positive class for the specific object is comprised of the undamaged instances of the some object class, while negative class is comprised of the damaged instances, with varying degrees, of the same object class.
[033] The positive image region of interest instances for the same object is the undamaged instance of the region of interest which is available. The negative image region of interest instances for the same object is limited damaged object data available for the same image region of interest instances. As the co-training progresses, slowly, more instances of damaged objects are captured as region proposals, and classified with confidence by at least one of the two base classifiers. It is implied here that any background regions from images which capture the objects, have to be purged post generation of region proposals.
[034] Further, the first factor of the weighted triplet loss function and the second factor of the weighted triplet loss function are obtained for computing the updated value of the weighted triplet loss function, for each of the feature extractor component. The first factor of the weighted triplet loss function is determined with the margins between the two loss sub-factors, for the positive image region of interest instances. The minuend loss sub-factor being a sum of the cumulative squared distance (L2) between the updated set of the positive class instances and the prediction of current region of interest being considered for co-training, and a very small positive real number, while the subtrahend sub-factor being the cumulative squared distance (L2) between the updated set of negative class instances and the prediction of current region of interest being considered for co-training , or zero if the above calculated margin between the two sub-factors is negative. The second factor of the weighted triplet loss function is determined with an absolute value of the cumulative squared distance (L2) between the expanded set of the positive image region of interest instances and the negative image region of interest instances. The weighted triplet loss function is computed by multiplying the first factor of the weighted triplet loss function and the second factor of the weighted triplet loss function. The updated weighted triplet loss function values for each of the feature extractor component is then used to fine-tune the first feature extractor and the second feature extractor before providing the input data to the first feature extractor and the second feature extractor for the current region of interest.
[035] In one embodiment, the modified loss function is based on the triplet loss and this loss is calculated by dynamically weighting of the triplet loss function. The modified loss function has been specifically designed with structural damage as anomaly considering the negative image region of interest of different instances. To calculate triplet loss the method obtained unlabeled data represented as some instances of undamaged class, as well as the damaged class. Further, the corresponding positive as well as negative image region of interest of the same instances as the unlabeled data is obtained. The triplet loss are then obtained from different views represented as mentioned below in equation 1, for some distance on the embedding space d, the loss of a triplet (u,p,n),
L=max(d(u,p)-d(u,n)+margin,0) ------------(1)
Minimization of the above equation 1, triplet loss pushes d(u,p) to 0 and d(u,n) to be greater than d(u,p)+margin. Triplet loss provides advantages for the image region of interest with high variance considering non-uniform degree of damaged instances of specific object as represented below in the figure,
[036] During co-training of the pair of base classifiers iteratively, more valid triplets arise by using the individual base classifiers is based on the higher confidence classification label. The outcome of predictions by both the base classifiers is used to rank the negative image region of interest for degree of damage. The multiplicative loss factor of the triplet loss signifies the degree of damage. Within every triplet, the distance between positive and negative instance signifies the degree of damage. To proportionally emphasize on the degree of damage, the weighted triplet loss function values computed are updated dynamically as more triplets are generated, as represented in equation 2,
L={abs(d(p,n))}•{max(d(u,p)-d(u,n)+margin,0)} --------------- (2)
The weighted triplet loss function such calculated is then used to fine-tune the first feature extractor and the second feature extractor components.
[037] The pair of base classifiers are co-trained with multiple iterations and the following sequential steps are performed in each iteration, from the same set of the plurality of image frames with each of the pair of base classifiers as learner using an ideally independent set of features as separate views. Further, the output of the bilateral transformation component generates two views of the unlabeled data is fed to the co-training component. During the co-training process, the first base classifier and the second base classifier iteratively labels the configurable number of unlabeled data image regions of interest that show the highest confidence value from its point of view. The confidence value is the relative confidence, where this classifier has certain high confidence in the example image region of interest, while the other classifier has a certain low confidence in the same image region of interest. The first set of feature space representation using the first bilateral transformation component and the second set of feature space representation using the second bilateral transformation component corresponding to these newly labeled example images are then added to the labeled training data for the learner of the other classifier. Each of the pair of base classifier learner is then re-trained using the expanded training data set. In the next iteration, the learner alternately and the other classifier labels few highly confident image region of interest examples from the remaining set of unlabeled example image regions of interest. The bilateral transformation output corresponding to these newly labeled example image regions of interest are then added to the labeled training data for the first or previous iteration’s classifiers learner. The first or previous iteration classifier learner are then re-trained with expanded training data. Further, the method transforms the first set of features into the first set of feature space representations using the first bilateral transformation and the second set of feature space representations using the second bilateral transformation. The first bilateral transformation component output is obtained from the first feature extractor component, by appending two or more dimensions to each tensor element of the input feature tensor. One dimension is the current time which is the same for all tensor elements. Augmenting the second spatial dimension is done by performing the analysis for the corresponding deep neural network till the input layer. All the pixels that are reachable from each neuron or tensor output would intuitively forms a region. Such region between the two successive layers are called as receptive field, and is a spatial window centered on a pixel. Accumulation of receptive field across multiple layers still retains a window or region-like configuration. By smoothness constraint, distance between the two image frames, the centroid of such region will also move smoothly. Hence we take centroid of all pixel positions reachable from a specific neuron in the feature output layer, and append that value as two independent dimensions to the feature value of that tensor element. In one embodiment, associating, using the first set of feature space representations obtained from the first bilateral transformation component and the second set of feature space representations obtained from the second bilateral transformation component, a true label for each of the element of the set of region of interest within the unlabeled data, wherein the predicted true label and associated regions are used for co-training the first base classifier and the second base classifier.
[038] At step 408 of the method 400, the processor 104 is configured to expanding, the labeled data from the supplementary unlabeled data and its true label.
[039] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[040] The embodiments of the present disclosure herein addresses unresolved problem of detecting damaged objects with limited availability of labeled data. The method performs specialized semi supervised learning based classification for detecting damaged objects or undamaged objects with limited availability of damaged object instances as the labeled and the unlabeled data. The pair of base classifiers are capable of acquiring new training instances for performing detection of damaged objects. The embodiment is able to robustly classify damaged objects, even in the presence of intra-class variability due to unknown degree of structural damage to objects.
[041] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[042] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[043] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[044] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[045] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.
,CLAIMS:1. A processor implemented method for co-training a pair of base classifiers, wherein the method comprises:
pre-processing, a plurality of image frames of a plurality of video streams to obtain a training data implemented by the processor 204, wherein preprocessing comprises labeling a set of image frames from the plurality of image frames to generate labeled data and tagging remaining image frames from the plurality of image frames as unlabeled data, wherein labeling comprises associating a pixel mask depicting a region of interest for an damaged object of each image frame from the set of image frames;
co-training, the first base classifier and the second base classifier using the labeled data associated with the training data marked with region of interest;
iteratively re-training, the first base classifier and the second base classifier using the unlabeled data, wherein the unlabeled data is identified with a true label for each iteration, wherein identifying the true label and retraining comprises,
identifying, from the unlabeled data a set of regions as possible objects of interest using a region proposal technique;
extracting, from the unlabeled data, a first set of features using a first feature extractor and a second set of features using a second feature extractor, wherein the first feature extractor and the second feature extractor are pre-trained using a weighted triplet loss function;
transforming, the first set of features into a first set of feature space representations using a first bilateral transformation and a second set of feature space representations using a second bilateral transformation;
associating, using the first set of feature space representations obtained from the first base classifier and the second set of feature space representations obtained from the second base classifier, for predicting a true label for each of the element of the set of region of interest within the unlabeled data, wherein the predicted true label and associated regions are used for co-training the first base classifier and the second base classifier; and
expanding, the labeled data from the supplementary unlabeled data and its true label.
2. The method as claimed in claim 1, wherein predicting the true label for the region of interest for the unlabeled data is based on the higher confidence value obtained from the first base classifier and the second base classifier.
3. The method as claimed in claim 1, wherein the first set of features are extracted from the first feature extractor comprising a fast region based convolutional neural network and the second set of features are extracted from the second feature extractor comprising a spatial pyramid pooling network.
4. The method as claimed in claim 1, wherein pre-training the first feature extractor and the second feature extractor using the weighted triplet loss function comprises:
obtaining, for every previous iteration based on the prediction from the pair of base classifiers, corresponding positive image regions of interest instances and corresponding negative image regions of interest instances from the set of region proposals of the unlabeled data;
expanding, with the newly determined region of interest instances from the set of region proposals from the unlabeled data, into the existing data set of the corresponding known positive image region of interest instances and the negative image region of interest instances;
obtaining,
a first factor of the weighted triplet loss function, by determining margins between the two loss sub-factors for the positive image region of interest instance; and
a second factor of the weighted triplet loss function, by determining an absolute value of the cumulative squared distance (L2) between the expanded set of the positive image region of interest instance and the negative image region of interest instance;
computing, an updated value for the weighted triplet loss function, by multiplying the first factor of the weighted triplet loss function and the second factor of the weighted triplet loss function; and
fine-tuning, using the updated weighted triplet loss function value, the first feature extractor and the second feature extractor before providing the input data to the first feature extractor and the second feature extractor for the current region of interest in the current iteration.
5. A system (102) for co-training a pair of base classifiers, the system (102) comprising:
a memory (208) storing instructions;
one or more Input/Output (I/O) interfaces (206);
and one or more processors (104) coupled to the memory (208) via the one or more I/O interfaces (206), wherein the processor (204) is configured by the instructions to:
pre-process, a plurality of image frames of a plurality of video streams to obtain a training data implemented by the processor 204, wherein preprocessing comprises labeling a set of image frames from the plurality of image frames to generate labeled data and tagging remaining image frames from the plurality of image frames as unlabeled data, wherein labeling comprises associating a pixel mask depicting a region of interest for an damaged object of each image frame from the set of image frames;
co-train, the first base classifier and the second base classifier using the labeled data associated with the training data marked with region of interest;
iteratively re-train, the first base classifier and the second base classifier using the unlabeled data, wherein the unlabeled data is identified with a true label for each iteration, wherein identifying the true label and retraining comprises,
identifying, from the unlabeled data a set of regions as possible objects of interest using a region proposal technique;
extracting, from the unlabeled data, a first set of features using a first feature extractor and a second set of features using a second feature extractor, wherein the first feature extractor and the second feature extractor are pre-trained using a weighted triplet loss function;
transforming, the first set of features into a first set of feature space representations using a first bilateral transformation and a second set of feature space representations using a second bilateral transformation;
associating, using the first set of feature space representations obtained from the first base classifier and the second set of feature space representations obtained from the second base classifier, for predicting a true label for each of the element of the set of region of interest within the unlabeled data, wherein the predicted true label and associated regions are used for co-training the first base classifier and the second base classifier; and
expand, the labeled data from the supplementary unlabeled data and its true label.
6. The system (102) as claimed in claim 5, wherein predicting the true label for the region of interest for the unlabeled data is based on the higher confidence value obtained from the first base classifier and the second base classifier.
7. The system (102) as claimed in claim 5, wherein the first set of features are extracted from the first feature extractor comprising a fast region based convolutional neural network and the second set of features are extracted from the second feature extractor comprising a spatial pyramid pooling network.
8. The system (102) as claimed in claim5, wherein pre-training the first feature extractor and the second feature extractor using the weighted triplet loss function comprises:
obtaining, for every previous iteration based on the prediction from the pair of base classifiers, corresponding positive image regions of interest instances and corresponding negative image regions of interest instances from the set of region proposals of the unlabeled data;
expanding, with the newly determined region of interest instances from the set of region proposals from the unlabeled data, into the existing data set of the corresponding known positive image region of interest instances and the negative image region of interest instances;
obtaining,
a first factor of the weighted triplet loss function, by determining margins between the two loss sub-factors for the positive image region of interest instance; and
a second factor of the weighted triplet loss function, by determining an absolute value of the cumulative squared distance (L2) between the expanded set of the positive image region of interest instance and the negative image region of interest instance;
computing, an updated value for the weighted triplet loss function, by multiplying the first factor of the weighted triplet loss function and the second factor of the weighted triplet loss function; and
fine-tuning, using the updated weighted triplet loss function value, the first feature extractor and the second feature extractor before providing the input data to the first feature extractor and the second feature extractor for the current region of interest in the current iteration.
| # | Name | Date |
|---|---|---|
| 1 | 201821008840-STATEMENT OF UNDERTAKING (FORM 3) [09-03-2018(online)].pdf | 2018-03-09 |
| 2 | 201821008840-PROVISIONAL SPECIFICATION [09-03-2018(online)].pdf | 2018-03-09 |
| 3 | 201821008840-FORM 1 [09-03-2018(online)].pdf | 2018-03-09 |
| 4 | 201821008840-DRAWINGS [09-03-2018(online)].pdf | 2018-03-09 |
| 5 | 201821008840-Proof of Right (MANDATORY) [17-03-2018(online)].pdf | 2018-03-17 |
| 6 | 201821008840-FORM-26 [26-04-2018(online)].pdf | 2018-04-26 |
| 7 | 201821008840-ORIGINAL UNDER RULE 6 (1A)-FORM 1-210318.pdf | 2018-08-11 |
| 8 | 201821008840-ORIGINAL UR 6( 1A) FORM 26-040518.pdf | 2018-08-14 |
| 9 | 201821008840-FORM 3 [08-03-2019(online)].pdf | 2019-03-08 |
| 10 | 201821008840-FORM 18 [08-03-2019(online)].pdf | 2019-03-08 |
| 11 | 201821008840-ENDORSEMENT BY INVENTORS [08-03-2019(online)].pdf | 2019-03-08 |
| 12 | 201821008840-DRAWING [08-03-2019(online)].pdf | 2019-03-08 |
| 13 | 201821008840-COMPLETE SPECIFICATION [08-03-2019(online)].pdf | 2019-03-08 |
| 14 | Abstract1.jpg | 2019-06-15 |
| 15 | 201821008840-OTHERS [27-07-2021(online)].pdf | 2021-07-27 |
| 16 | 201821008840-FER_SER_REPLY [27-07-2021(online)].pdf | 2021-07-27 |
| 17 | 201821008840-COMPLETE SPECIFICATION [27-07-2021(online)].pdf | 2021-07-27 |
| 18 | 201821008840-CLAIMS [27-07-2021(online)].pdf | 2021-07-27 |
| 19 | 201821008840-FER.pdf | 2021-10-18 |
| 20 | 201821008840-US(14)-HearingNotice-(HearingDate-20-11-2025).pdf | 2025-10-29 |
| 21 | 201821008840-FORM-26 [10-11-2025(online)].pdf | 2025-11-10 |
| 22 | 201821008840-FORM-26 [10-11-2025(online)]-1.pdf | 2025-11-10 |
| 23 | 201821008840-Correspondence to notify the Controller [10-11-2025(online)].pdf | 2025-11-10 |
| 24 | 201821008840-US(14)-ExtendedHearingNotice-(HearingDate-28-11-2025)-1200.pdf | 2025-11-18 |
| 1 | 2021-01-2215-01-21E_27-01-2021.pdf |