System And Method For Applying A Building Segmentation Model On

< Back

System And Method For Applying A Building Segmentation Model On Misaligned Ground Truth Data

Abstract: The present disclosure provides system 208 and method 1000 for training a deep learning model for semantic segmentation of building footprints. The method includes receiving 1002 aerial images and ground truth-masked images of the building footprints and determining 1004 a plurality of ground truth polygon masks in ground truth-masked image and a plurality of boundary polygon masks in the aerial image using an image processing technique. Further, the method includes determining 1006 that at least one ground truth polygon mask is closely matched with each boundary polygon mask and comparing 1008 the at least one closely matched ground truth polygon mask with each boundary polygon mask. Furthermore, the method includes determining 1010 a standard loss based on the comparison and training 1012 the deep learning model for semantic segmentation of the building footprints based on the standard loss.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

30 November 2022

Publication Number

22/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

JIO PLATFORMS LIMITED

Office-101, Saffron, Nr. Centre Point, Panchwati 5 Rasta, Ambawadi, Ahmedabad - 380006, Gujarat, India.

Inventors

1. PAILLA, Balakrishna

Type 6 QTRS, Audit Bhavan, Green Valley, Alto Porvarim, Goa - 403521, India.

2. PANDEY, Naveen Kumar

402, Sai Krupa Sandesh, Ranganatha Layout, Mahadevpura, Bengaluru- 560048, Karnataka, India.

Specification

DESC:RESERVATION OF RIGHTS
[001] A portion of the disclosure of this patent document contains material which is subject to intellectual property rights such as, but are not limited to, copyright, design, trademark, integrated circuit (IC) layout design, and/or trade dress protection, belonging to Jio Platforms Limited (JPL) or its affiliates (herein after referred as owner). The owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights whatsoever. All rights to such intellectual property are fully reserved by the owner.

TECHNICAL FIELD
[002] The present disclosure relates to a field of training a deep learning model, and specifically to a system and a method for utilizing a large dataset of a satellite imagery with shifted ground truth data for training the deep learning model for semantic segmentation of building footprints.

BACKGROUND
[003] The following description of related art is intended to provide background information pertaining to the field of the disclosure. This section may include certain aspects of the art that may be related to various features of the present disclosure. However, it should be appreciated that this section be used only to enhance the understanding of the reader with respect to the present disclosure, and not as admissions of prior art.
[004] In recent years, there has been a sharp increase in amount of satellite imagery that is available for building detection problem. However, usability of this huge data for training a deep learning model is challenging as satellite images and corresponding ground truth label mask are often received from different temporal instances, i.e., ground truth data recorded in geospatial databases with respect to different satellite viewpoints in orbit is received but the current satellite imagery is obtained from a different viewpoint due to movement of satellite in the orbit. As a result, there is a noticeable shift in alignment of most recent satellite imagery and the stored ground truth mask of an earlier time. In other cases, the ground truth mask includes a subset of the buildings visible at a previous time and does not account for a new building which has just recently popped up in the area.
[005] Patent document US20220156526A1 describes a computer-implemented method. The method may include collecting a set of labels that label polygons within a training set of images as architectural structures. The method may also include creating a set of noisy labels with a predetermined degree of noise by distorting boundaries of a number of the polygons within the training set of images. Additionally, the method may include simultaneously training two neural networks by applying a co-teaching method to learn from the set of noisy labels. The method may also include extracting a preferential list of training data based on the two trained neural networks. Furthermore, the method may include training a machine learning model with the preferential list of training data. Finally, the method may include identifying one or more building footprints in a target image using the trained machine learning model. Various other methods, systems, and computer-readable media are also disclosed
[006] Another patent document CN113516135A describes a method for remote sensing image building extraction and contour optimization based on deep learning, belonging to the field of environmental measurement. The approach applies semantic segmentation to building extraction, and enhances building contour optimization by incorporating Hausdorff distance. The method introduces a residual structure, a convolution attention module, and pyramid pooling into a Unet model. This utilization leverages the feature extraction capabilities of the residual module, the spatial and channel information balancing capabilities of the convolution attention module, and the multi-scale scene analysis features of the pyramid pooling module. This leads to the establishment of a PRCUnet model that simultaneously considers semantic and detail information, addressing the Unet's deficiency in small target detection. According to this method, the IoU and recall rate for the dataset both achieve 85% or higher, with significantly superior precision compared to a Unet model. The precision of the extracted buildings is higher, and the optimized building boundaries closely approximate the real building contours.
[007] Yet another patent document US10592765B2 describes various methods and systems for generating information from images of a building. In one example, Two-Dimensional (2D) building and/or building element information may be derived from overlapping 2D images of the building. Three-dimensional (3D) building and building element information may be generated based on the 2D building and/or building element information. The 2D image information may be combined with 3D information about the building and/or building elements to generate projective geometry information. Clustered 3D information may be obtained by partitioning and grouping 3D data points. An information set associated with the building and/or at least one building element may then be created.
[008] The conventional methods and systems from the above-cited references have limitations such as using excessive or insufficient degree of noise may affect the training process negatively and the process of combining 2D and 3D information may introduce errors or inconsistencies in the final projective geometry information.
[009] Therefore, there is, a need in the art for an improved system and a method that directly utilizes a large dataset of satellite imagery with shifted ground truth data for training a convolutional neural network (CNN) model for semantic segmentation of building footprints.

OBJECTS OF THE PRESENT DISCLOSURE
[0010] It is an object of the present disclosure to provide a weakly supervised approach for learning a robust and accurate building segmentation model directly on misaligned ground truth data.
[0011] It is an object of the present disclosure to provide a system and a method to directly utilize a large dataset of satellite imagery with shifted ground truth data for training a convolutional neural network (CNN) model for a semantic segmentation of building footprints.
[0012] It is an object of the present disclosure to provide a system and a method for determining a distance transformation between a ground truth polygon mask and a predicted boundary polygon mask.
[0013] It is an object of the present disclosure to provide a system and a method for comparing a closely matched ground truth polygon mask with a predicted boundary polygon mask to determine a standard loss for back propagation.

SUMMARY
[0014] This section is provided to introduce certain objects and aspects of the present disclosure in a simplified form that are further described below in the detailed description. This summary is not intended to identify the key features or the scope of the claimed subject matter.
[0015] In an aspect, the present disclosure relates to a system for training a deep learning model for semantic segmentation of building footprints. The system includes one or more processors and a memory operatively coupled to the one or more processors, wherein the memory includes processor-executable instructions, which on execution, cause the one or more processors to receive a plurality of aerial images and a plurality of ground truth-masked images of the building footprints and determine a plurality of ground truth polygon masks in at least one ground truth-masked image and a plurality of boundary polygon masks in at least one aerial image using an image processing technique. Further, the one or more processors are to determine that at least one ground truth polygon mask is closely matched with each boundary polygon mask based on the determination, and compare the at least one closely matched ground truth polygon mask with said each boundary polygon mask. Further, the one or more processors are to determine a standard loss based on the comparison and train the deep learning model for sematic segmentation of the building footprints based on the standard loss.
[0016] In an embodiment, the one or more processors may correlate the at least one aerial image and the at least one ground truth-masked image to predict a binary segmentation mask of the at least one aerial image and determine the plurality of boundary polygon masks in the at least one predicted binary segmentation mask of the at least one aerial image using the image processing technique based on the prediction.
[0017] In an embodiment, the one or more processors may compare said at least one ground truth polygon mask with said each boundary polygon mask and determine a distance transformation between said at least one ground truth polygon mask and said each boundary polygon mask based on the comparison.
[0018] In an embodiment, the distance transformation replaces a value of each pixel in said each boundary polygon mask with a distance to a nearest boundary pixel within said each ground truth polygon mask.
[0019] In an embodiment, the one or more processors may determine an error of said each boundary polygon mask based on the distance transformation and compare the error of said each boundary polygon mask with said at least one ground truth polygon mask using the distance transformation. Further, the one or more processors may determine a lowest error based on the comparison and determine the at least one closely matched ground truth polygon mask based on the lowest error. Furthermore, the one or more processors may assign the at least one closely matched ground truth polygon mask to said each boundary polygon mask.
[0020] In an embodiment, the one or more processors may determine an overlap range between said each boundary polygon mask with said at least one ground truth polygon mask to determine the error.
[0021] In an embodiment, the one or more processors may apply a transformation matrix between said each boundary polygon mask and the at least one closely matched ground truth polygon mask and determine the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask.
[0022] In an embodiment, the one or more processors may select the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask to determine the standard loss.
[0023] In an embodiment, the one or more processors may back propagate the determined standard loss to the deep learning model for training.
[0024] In another aspect, the present disclosure relates to a method for training a deep learning model for semantic segmentation of building footprints. The method includes receiving, by one or more processors, a plurality of aerial images and a plurality of ground truth-masked images of the building footprints and determining, by the one or more processors, a plurality of ground truth polygon masks in at least one ground truth-masked image and a plurality of boundary polygon masks in at least one aerial image using an image processing technique. Further, the method includes determining, by the one or more processors, that at least one ground truth polygon mask is closely matched with said each boundary polygon mask based on the determination and comparing, by the one or more processors, the at least one closely matched ground truth polygon mask with said each boundary polygon mask. Furthermore, the method includes determining, by the one or more processors, a standard loss based on the comparison and training the deep learning model for semantic segmentation of the building footprints based on the standard loss.
[0025] In an embodiment, for determining, by the one or more processors, the plurality of boundary polygon masks in the at least one aerial image, the method may include correlating, by the one or more processors, the at least one aerial image and the at least one ground truth-masked image and predicting, by the one or more processors, a binary segmentation mask of the at least one aerial image based on the correlation. Further, the method may include determining, by the one or more processors, the plurality of boundary polygon masks in the at least one predicted binary segmentation mask of the at least one aerial image using the image processing technique based on the prediction.
[0026] In an embodiment, for determining, by the one or more processors, that the at least one ground truth polygon mask is closely matched with said each boundary polygon mask based on the determination, the method may include comparing, by the one or more processors, said at least one ground truth polygon mask with said each boundary polygon mask and determining, by the one or more processors, a distance transformation between said at least one ground truth polygon mask and said each boundary polygon mask based on the comparison.
[0027] In an embodiment, the distance transformation replaces a value of each pixel in said each boundary polygon mask with a distance to a nearest boundary pixel within said at least one ground truth polygon mask.
[0028] In an embodiment, for determining, by the one or more processors, the at least one ground truth polygon mask is closely matched with said each boundary polygon mask, the method may include determining, by the one or more processors, an error of said each boundary polygon mask based on the distance transformation and comparing, by the one or more processors, the error of said each boundary polygon mask with said at least one ground truth polygon mask using the distance transformation. Further, the method may include determining, by the one or more processors, a lowest error based on the comparison and determining, by the one or more processors, the at least one closely matched ground truth polygon mask based on the lowest error. Furthermore, the method may include assigning, by the one or more processors, the at least one closely matched ground truth polygon mask to said each boundary polygon mask.
[0029] In an embodiment, for determining, by the one or more processors, the error of said each boundary polygon mask, the method may include determining, by the one or more processors, an overlap range between said at least one boundary polygon mask with said each ground truth polygon mask to determine the error.
[0030] In an embodiment, for determining, by the one or more processors, the lowest error, the method may include applying, by the one or more processors, a transformation matrix between said each boundary polygon mask and the at least one closely matched ground truth polygon mask and determining, by the one or more processors, the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask.
[0031] In an embodiment, for determining by the one or more processors the standard loss based on the comparison, the method may include selecting, by the one or more processors, the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask.
[0032] In an embodiment, for training, by the one or more processors (210), the deep learning model for semantic segmentation of the building footprints based on the standard loss, the method may include back propagating, by the one or more processors, the determined standard loss to the deep learning model.
BRIEF DESCRIPTION OF THE DRAWINGS
[0033] In the figures, similar components and/or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label with a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.
[0034] The diagrams are for illustration only, which thus is not a limitation of the present disclosure, and wherein:
[0035] FIG. 1A and FIG. 1B illustrate schematic representations depicting a shift in alignment of most recent satellite imagery and stored ground truth mask of an earlier time, in accordance with prior arts.
[0036] FIG. 2A illustrates an exemplary network architecture 200A in which or with which embodiments of the present disclosure may be implemented.
[0037] FIG. 2B illustrates an example block diagram 200B of a system 208 for training a deep learning model using satellite imagery along with misaligned ground truth data, in accordance with an embodiment of the present disclosure.
[0038] FIG. 3 illustrates an example representation of a system 208 for providing a supervised approach for learning a robust and accurate building segmentation model directly on the misaligned ground truth data 300, in accordance with an embodiment of the present disclosure.
[0039] FIG. 4 illustrates an example block diagram of a matching ground truth polygon mask selection mechanism and a matching ground truth polygon mask alignment mechanism 400, in accordance with an embodiment of the present disclosure.
[0040] FIG. 5 illustrates an example representation of a ground truth polygon mask 500, in accordance with an embodiment of the present disclosure.
[0041] FIG. 6 illustrates an example representation of an inferred output boundary polygon 600, in accordance with an embodiment of the present disclosure.
[0042] FIG. 7 illustrates an example representation of a selected polygon image and a distance transform grayscale image 700 in accordance with an embodiment of the present disclosure.
[0043] FIG. 8 illustrates an example representation of a selected best-matching ground truth polygon mask selection 800, in accordance with an embodiment of the disclosure.
[0044] FIG. 9 illustrates an example representation of a best-matching ground truth polygon mask refinement 900, in accordance with an embodiment of the disclosure.
[0045] FIG. 10 illustrates an exemplary flow chart for implementing a method 1000 for training a deep learning model using satellite imagery along with misaligned ground truth data, in accordance with an embodiment of the disclosure.
[0046] FIG. 11 illustrates an exemplary computer system 1100 in which or with which embodiments of the present disclosure may be implemented.
[0047] The foregoing shall be more apparent from the following more detailed description of the disclosure.

DETAILED DESCRIPTION
[0048] In the following description, for the purposes of explanation, various specific details are set forth in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent, however, that embodiments of the present disclosure may be practiced without these specific details. Several features described hereafter can each be used independently of one another or with any combination of other features. An individual feature may not address all of the problems discussed above or might address only some of the problems discussed above. Some of the problems discussed above might not be fully addressed by any of the features described herein.
[0049] The ensuing description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth.
[0050] Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.
[0051] Also, it is noted that individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination can correspond to a return of the function to the calling function or the main function.
[0052] The word “exemplary” and/or “demonstrative” is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. In addition, any aspect or design described herein as “exemplary” and/or “demonstrative” is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to preclude equivalent exemplary structures and techniques known to those of ordinary skill in the art. Furthermore, to the extent that the terms “includes,” “has,” “contains,” and other similar words are used in either the detailed description or the claims, such terms are intended to be inclusive—in a manner similar to the term “comprising” as an open transition word—without precluding any additional or other elements.
[0053] Reference throughout this specification to “one embodiment” or “an embodiment” or “an instance” or “one instance” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
[0054] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
[0055] Lately, there has been a sharp increase in availability of amount of satellite imagery data for solving a segmentation problem. However, usability of this data for training a deep learning model has been a challenging task as the satellite images and corresponding ground truth label mask often come from different temporal instances. As a result, there is a noticeable shift in alignment of most recent satellite imagery and the stored ground truth mask of an earlier time. In addition, the available ground truth mask comprises of a subset of the buildings visible at a previous/earlier time and does not account for recently created new buildings in the area.
[0056] FIGs. 1A and 1B illustrate at 100A and 100B schematic representations depicting a shift in alignment of most recent satellite imagery and the stored ground truth mask of an earlier time, in accordance with prior arts.
[0057] In conventional neural networks based semantic segmentation approach, learning depends upon computing a difference between a segmentation mask resulting from a network inference and a ground truth segmentation mask, and propagating the difference back to layers of the networks to adjust their parameters. For example, a linear combination of distance weighted binary cross entropy loss and dice loss is determined to compute the difference between the ground truth and segmentation mask, or i.e.,
Loss = (1-a)*LDBCE+a*LDice
where,
LDBCE = -(beta * ylog(yhat)) + (1-y)log(1-yhat) { y is ground truth label,
yhat is predicted label
LDice = 1- (2 * True Positive / (2 * True Positives + False Negatives + False Positives))
Above loss functions work as a reliable proxy to compute a prediction efficiency, using certain standard metrics for building footprints extraction such as mIOU (mean Intersection over Union) is a distance metric used to measure similarity between the bounding boxes, and accuracy is defined as the percentage of correctly classified pixels.
[0058] This represents that the shape and position of the ground truth plays an important role in learning a robust representation of the given data. Further, misaligned ground truth prohibits from learning a robust representation on the shifted satellite image data. Thus, to mitigate the issue, the proposed system and method provide a weakly supervised learning approach to directly utilise this large dataset of satellite imagery with misaligned ground truth data for training the CNN model for semantic segmentation of building footprints. Various embodiments of the present disclosure will be explained in detail with reference to FIGs. 2A-11.
[0059] The proposed system and method disclose an automated image understanding and processing of remotely sensed satellite imagery. Advanced techniques of computer vision, image processing, statistical pattern classification, and machine learning are applied to classify and segment features of interest like buildings in the satellite imagery. The proposed system and method enable development of robust and very accurate building segmentation models even in absence of perfectly aligned ground truth segmentation data.
[0060] In accordance with embodiments of the present disclosure, the proposed system and method perform segmentation of buildings from aerial (satellite/drone) images. Availability of high-resolution remote sensing data has opened up a possibility for interesting applications, such as per-pixel classification of individual objects in higher details. Using a Convolution Neural Network (CNN) enables efficient and smart segmentation, and classification of images.
[0061] In an embodiment, the proposed system and method calculates a shift aware loss between predicted and ground truth masked images and propagate it back to network instead of calculating a standard loss (such as linear combination of dice and weighted binary cross entropy loss) between the predicted masks and the ground truth masked images.
[0062] In an embodiment, the shift aware loss may be calculated in two steps. First, by performing a matching ground truth selection, where for each predicted polygon from a segmentation network, a distance transform with respect to the boundary of the polygon may be computed and used as a shape similarity measure to identify the best matching ground truth polygon mask. Second, by performing a ground truth alignment correction, where a transformation matrix (rotation and translation) may be approximated between each predicted polygon and it’s best matching ground truth polygon mask. This may be achieved in an iterative manner by repeatedly perturbing (random rotation and translation in a controlled range) the ground truth and comparing the shape similarity with respect to the distance transform of a predicted polygon as in the first step. A minimum distance across perturbations may be taken as a best approximation of the transformation matrix.
[0063] In an embodiment, the best-obtained transformation of the ground truth for each predicted polygon may be applied to calculate the standard loss (e.g., a linear combination of dice and weighted binary cross entropy loss).
[0064] The terms “boundary polygon mask” and “predicted boundary polygon mask” are used interchangeably throughout the specification.
[0065] FIG. 2A illustrates an exemplary network architecture 200A in which or with which embodiments of the present disclosure may be implemented.
[0066] As illustrated in FIG. 2A, by way of an example and not by limitation, the exemplary network architecture 200 may include a plurality of computing devices 204-1, 204-2…204-N, which may be individually referred as the computing device 204 and collectively referred as the computing devices 204.
[0067] In an embodiment, the computing device 204 may include smart devices operating in a smart environment, for example, an Internet of Things (IoT) system. In such an embodiment, the computing device 204 may include, but is not limited to, smart phones, smart watches, smart sensors (e.g., mechanical, thermal, electrical, magnetic, etc.), networked appliances, networked peripheral devices, networked lighting system, communication devices, networked vehicle accessories, networked vehicular devices, smart accessories, tablets, smart television (TV), computers, smart security system, smart home system, other devices for monitoring or interacting with or for the users and/or entities, or any combination thereof.
[0068] A person of ordinary skill in the art will appreciate that the computing device or a user equipment 204 may include, but is not limited to, intelligent, multi-sensing, network-connected devices, that may integrate seamlessly with each other and/or with a central server or a cloud-computing system or any other device that is network-connected.
[0069] In an embodiment, the computing device 204 may include, but is not limited to, a handheld wireless communication device (e.g., a mobile phone, a smart phone, a phablet device, and so on), a wearable computer device (e.g., a head-mounted display computer device, a head-mounted camera device, a wristwatch computer device, and so on), a Global Positioning System (GPS) device, a laptop computer, a tablet computer, or another type of portable computer, a media playing device, a portable gaming system, and/or any other type of computer device with wireless communication capabilities, and the like. In an embodiment, the computing device 204 may include, but is not limited to, any electrical, electronic, electro-mechanical, or an equipment, or a combination of one or more of the above devices such as virtual reality (VR) devices, augmented reality (AR) devices, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computing device. The computing device 204 may include one or more in-built or externally coupled accessories including, but not limited to, a visual aid device such as a camera, an audio aid, a microphone, a keyboard, and input devices for receiving input from the user or the entity such as a touch pad, a touch enabled screen, an electronic pen, and the like. A person of ordinary skill in the art may appreciate that the computing device 204 may not be restricted to the mentioned devices and various other devices may be used.
[0070] In an exemplary embodiment, the computing device/user equipment 204 may communicate with the system 208 through a network 206. The network 206 may include, by way of example but not limitation, at least a portion of one or more networks having one or more nodes that transmit, receive, forward, generate, buffer, store, route, switch, process, or a combination thereof, etc. one or more messages, packets, signals, waves, voltage or current levels, some combination thereof, or so forth. The network 206 may include, by way of example but not limitation, one or more of: a wireless network, a wired network, an internet, an intranet, a public network, a private network, a packet-switched network, a circuit-switched network, an ad hoc network, an infrastructure network, a public-switched telephone network (PSTN), a cable network, a cellular network, a satellite network, a fiber optic network, some combination thereof.
[0071] In an embodiment, the system 208 may receive a plurality of aerial images and a plurality of ground truth-masked images of the building footprints and determine a plurality of ground truth polygon masks in at least one ground truth-masked image and a plurality of boundary polygon masks in at least one aerial image using an image processing technique. Further, the system 208 may determine at least one ground truth polygon mask is closely matched with each boundary polygon mask based on the determination and compare the at least one closely matched ground truth polygon mask with said each boundary polygon mask. Furthermore, the system 208 may determine a standard loss based on the comparison and train the deep learning model for sematic segmentation of the building footprints based on the standard loss.
[0072] Although FIG. 2A shows exemplary components of the network architecture 200A, in other embodiments, the network architecture 200A may include fewer components, different components, differently arranged components, or additional functional components than depicted in FIG. 2A. Additionally, or alternatively, one or more components of the network architecture 200A may perform functions described as being performed by one or more other components of the network architecture 200A.
[0073] FIG. 2B illustrates an example block diagram 200B of a system 208 for training a deep learning model using satellite imagery along with misaligned ground truth data, in accordance with an embodiment of the present disclosure.
[0074] Referring to FIG. 2B, the system 208 may include one or more processors 210, a memory 212, and an interface(s) 214. The one or more processors 210 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that manipulate data based on operational instructions. Among other capabilities, the one or more processor(s) 210 may be configured to fetch and execute computer-readable instructions stored in the memory 212 of the system 208. The memory 212 may store one or more computer-readable instructions or routines, which may be fetched and executed to create or share the data units over a network service. The memory 212 may include any non-transitory storage device including, for example, volatile memory such as Random-Access Memory (RAM), or non-volatile memory such as Erasable Programmable Read-Only Memory (EPROM), flash memory, and the like.
[0075] The interface(s) 214 may comprise a variety of interfaces, for example, a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) 214 may facilitate communication of the system 208 with various devices coupled to it. The interface(s) 214 may also provide a communication pathway for one or more components of the system 208. Examples of such components include, but are not limited to, processing engine(s) 216 and a database 218, where the database 218 may include data that is either stored or generated as a result of functionalities implemented by any of the components of the processing engine(s) 216.
[0076] In an embodiment, the processing engine(s) 216 may be implemented as a combination of hardware and programming (for example, programmable instructions) to implement one or more functionalities of the processing engine(s) 216. In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the processing engine(s) 216 may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the one or more processor(s) 210 may comprise a processing resource (for example, one or more processors), to execute such instructions. In the present examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the processing engine(s) 216. In such examples, the system 208 may comprise the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system 208 and the processing resource. In other examples, the processing engine(s) 216 may be implemented by an electronic circuitry. The processing engine(s) 216 may include a correlation module 220, a boundary polygon mask determination module 222, a distance transform determination module 224, a closely matching polygon selection module 226, an error determination module 228, and other module(s) 230.
[0077] The system 208 may receive aerial images and ground truth-masked images, where the aerial may be, but not limited to satellites, drones/Unmanned Aerial Vehicles (UAVs), airplanes, parachutes, space shuttles, and the like. The correlation module 220 may correlate an aerial image and a ground truth-masked image and predict a binary segmentation mask. The boundary polygon mask determination module 222 may determine ground truth polygon masks in the ground truth-masked image and boundary polygon masks in the aerial images using an image processing technique. The distance transform determination module 224 may compare the ground truth polygon masks with the boundary polygon masks to determine the distance transformation between ground truth polygon masks and the boundary polygon masks. The distance transformation may replace a value of each pixel in the boundary polygon masks with its distance to a nearest boundary pixel within the ground truth polygon masks.
[0078] Once the distance is measured between the ground truth polygon masks and the boundary polygon masks, the closely matching polygon selection module 226 may determine an overlap range between the boundary polygon masks with the ground truth polygon masks to determine an error of the boundary polygon mask based on the overlapping. Also, the closely matching polygon selection module 226 may compare the error of the boundary polygon masks with the ground truth polygon masks using the distance transformation. Once the comparison is done, the closely matching polygon selection module 226 may determine a lowest error between the ground truth polygon masks and the boundary polygon masks. Once the lowest error is determined, the closely matching polygon selection module 226 may determine the closely matched ground truth polygon mask with the boundary polygon masks and assign the closely matched ground truth polygon mask to the boundary polygon mask.
[0079] The error determination module 228 may compare the closely matched ground truth polygon mask with the predicted boundary polygon masks to determine a standard loss. It may be appreciated that the standard loss may be interchangeably referred to as a shift aware loss. Once the standard loss is determined, the one or more processors 202 may back propagate the standard loss to the deep learning model. The other unit(s) 230 may implement functionalities that supplement applications/functions performed by the processing engine(s) 216.
[0080] FIG. 3 illustrates an example representation 300 of a system 208 for providing a supervised approach for learning a robust and accurate building segmentation model directly on the misaligned ground truth data, in accordance with an embodiment of the disclosure.
[0081] With respect to FIG. 3, a satellite image may be encoded (by an encoder) and decoded (by a decoder). The decoded satellite image may be forward passed to produce an inference output. Using a ground truth masked images and the inference output, a matching ground truth polygon mask selection may be performed as represented in block 302. Further, as represented in block 304, a matching ground truth polygon mask alignment may be performed to determine a shift-aware loss. The shift-aware loss may be executed and back-propagated to the decoder. Referring to FIG. 3, a building footprint segmentation model may be trained with the training data. During training, a variant of a deep learning segmentation model (like encoder-decoder, etc.) may take a satellite image ‘I’ and a ground truth mask ‘M,’ and predict a binary segmentation mask ‘P’ of the satellite image.
[0082] FIG. 4 illustrates an example block diagram 400of a matching ground truth polygon mask selection mechanism and a matching ground truth polygon mask alignment mechanism 400, in accordance with an embodiment of the disclosure.
[0083] Referring to the FIG. 4, the matching ground truth polygon mask selection 302 may include certain procedures such as extracting boundaries from predicted boundary polygon mask and ground truth polygon masks as represented in block 402. At block 404, the method may include determining a distance transform for each predicted boundary polygon mask. At block 406, the method may include determining a shape similarity error of each ground truth boundary polygon with respect to predicted boundary polygon using the distance transform. At block 408, the method may include assigning the ground truth polygon mask to predict a minimal shape similarity error.
[0084] Referring to the FIG. 4, the matching ground truth polygon mask alignment 304 may include certain procedures such as iteratively perturbing the best matching ground truth polygon mask and compute shape similarity error with respect to predicted boundary polygon as represented in block 410. At block 412, the method may include applying the transformation with minimal shape similarity error to the ground truth polygon mask and compute a standard loss.
[0085] With respect to FIG. 4, the training data may be used to train a building footprint extraction model for a diverse set of areas. An instance of the training data may include a patch of satellite image and corresponding ground truth polygon masks. The satellite image corresponding to an area may be obtained/purchased/downloaded from multiple satellite imagery vendors while the ground truth masked images for building footprints are extracted from the geospatial databases. The ground truth masked images could have been a result of a manual annotation using a geographic information system (GIS) or a historical run of a building footprint model prediction on satellite imagery available at that time from a particular vendor. The satellite image tile of the area on which the ground truth masked images were annotated/obtained may not be as same as the satellite image tile of the same particular area in the training data due to differences in satellite imagery vendor’s hardware, orbits, viewpoints, weather, and resolution, i.e., ground truth masked images may be misaligned with respect to the training satellite images.
[0086] FIGs. 5 and 6 illustrate an example representation of a ground truth polygon mask 500 and inferred output boundary polygon 600, in accordance with an embodiment of the disclosure.
[0087] Referring to FIGs. 5 and 6, a boundary polygon mask for both ground truth polygon mask and a predicted binary segmentation mask may be determined using classical image processing techniques.
[0088] In an embodiment, for each boundary polygon mask in the predicted binary segmentation mask, a distance transformation of the polygon image is determined, where a value of each pixel in the image may be replaced by its distance to a nearest boundary pixel of a selected boundary polygon mask.
[0089] FIG. 7 illustrates an example representation of a selected polygon image and a distance transform grayscale image 700, in accordance with an embodiment of the disclosure. As illustrated, blackness of each pixel in the distance transform image implies a closer distance to a polygon boundary.
[0090] In another embodiment, for each predicted polygon, a best matching ground truth polygon mask may be identified. This may be done by comparing a shape similarity error of all the ground truth boundary polygons to predicted polygon using a previously determined distance transform. Further, the ground truth may be assigned with the lowest error. The shape similarity error between a ground truth polygon mask and a predicted boundary polygon may be given by determining an average of pixel values (i.e., distance) in a distance transform image that overlaps with the ground truth boundary polygon pixel location.
[0091] In some embodiments, for the ground truth polygon that is completely aligned to the predicted polygon boundaries, the shape similarity error may be zero. For example, for a centre polygon, most of the overlapping pixels in distance transform are black, i.e., smaller distance -> smaller error value.
[0092] FIG. 8 illustrates a selected best matching ground truth polygon mask selection 800, in accordance with an embodiment of the disclosure.
[0093] In an embodiment, after identifying the best matching ground truth polygon mask for each predicted boundary polygon mask, the following steps may be performed iteratively: (i) perturb (apply random rotation and translation in a small range) the best matching ground truth boundary polygon, and (ii) determine shape similarity error with respect to the boundary polygon as in the previous step to approximate a transformation matrix that results in minimum shape similarity error between the predicted boundary polygon and the matching ground truth boundary polygon.
[0094] FIG. 9 illustrates a best matching ground truth polygon mask refinement 900, in accordance with an embodiment of the disclosure.
[0095] In another embodiment, the best obtained transformation may be applied on the matching ground truth polygon mask for each boundary polygon mask. Further, a standard loss (such as a linear combination of dice and weighted binary cross entropy loss) may be determined to back propagate through a deep learning network model.
[0096] FIG. 10 illustrates an exemplary flow chart for implementing a method 1000 for training a deep learning model using satellite imagery along with misaligned ground truth data, in accordance with an embodiment of the disclosure.
[0097] At block 1002, the method 1000 may include receiving a plurality of aerial image and a plurality of ground truth masked image of the building footprints.
[0098] At block 1004, the method 1000 may include determining a plurality of ground truth polygon masks in at least one ground truth-masked image and a plurality of boundary polygon masks in the at least one aerial image using an image processing technique. Further, the method 1000 may include predicting a binary segmentation mask of at least one aerial image to determine the plurality of boundary polygon masks. Further, the method 1000 may include correlating the at least one aerial image and the at least one ground truth-masked image and predicting a binary segmentation mask of the at least one aerial image based on the correlation to determine the plurality of boundary polygon masks in the at least one predicted binary segmentation mask of the at least one aerial image using the image processing technique based on the prediction.
[0099] At block 1006, the method 1000 may include determining at least one ground truth polygon mask is closely matched with each boundary polygon mask. Further, the method 1000 may include comparing each ground truth polygon mask with each boundary polygon mask to determine the distance transformation between each ground truth polygon mask and each boundary polygon mask.
[00100] At block 1008, the method 1000 may include comparing the at least one closely matched ground truth polygon mask with each boundary polygon mask.
[00101] At block 1010, the method 1000 may include determining a standard loss based on the comparison. Further, the method 1000 may include selecting the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask.
[00102] At block 1012, the method 1000 may include training the deep learning model for semantic segmentation of the building footprints based on the standard loss. Further, the method 1000 may include back propagating the determined standard loss to the deep learning model.
[00103] As may be appreciated, to provide power of data and internet to rural areas and to lowest economic strata across geographies, a humongous amount of careful planning and estimation of network is required. Geospatial analytics and visualisation and applications powered by rich geospatial data enrichment power these estimation applications. For rapid pace of business expansion, the disclosed system and method enables using already available data to develop automated feature extraction from satellite data.
[00104] Therefore, the disclosed system and method attempts to learn a robust extraction model given noisy ground truth labels utilizing the information symmetry present in the ground truth i.e., the shape of the building and their relationships with other buildings.
[00105] FIG. 11 illustrates an exemplary computer system 1100 in which or with which embodiments of the present disclosure may be implemented. As shown in FIG. 11, the computer system 1100 may include an external storage device 1110, a bus 1120, a main memory 1130, a read-only memory 1140, a mass storage device 1150, communication port(s) 1160, and a processor 1170. A person skilled in the art will appreciate that the computer system 1100 may include more than one processor and communication ports. The processor 1170 may include various modules associated with embodiments of the present disclosure. The communication port(s) 1160 may be any of an RS-232 port for use with a modem-based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. The communication port(s) 1160 may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system 1100 connects. The main memory 1130 may be a Random Access Memory (RAM), or any other dynamic storage device commonly known in the art. The read-only memory 1140 may be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for the processor 1170. The mass storage device 1150 may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage device 1150 includes, but is not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g., an array of disks.
[00106] The bus 1120 communicatively couples the processor 1170 with the other memory, storage, and communication blocks. The bus 1120 may be, e.g., a Peripheral Component Interconnect (PCI)/PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), universal serial bus (USB), or the like, for connecting expansion cards, drives, and other subsystems as well as other buses, such a front side bus (FSB), which connects the processor 1170 to the computer system 1100.
[00107] Optionally, operator and administrative interfaces, e.g., a display, keyboard, joystick and a cursor control device, may also be coupled to the bus 1120 to support direct operator interaction with the computer system 1100. Other operator and administrative interfaces can be provided through network connections connected through the communication port(s) 1160. Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system limit the scope of the present disclosure.
[00108] While the foregoing describes various embodiments of the disclosure, other and further embodiments of the disclosure may be devised without departing from the basic scope thereof. The scope of the disclosure is determined by the claims that shall follow. The disclosure is not limited to the described embodiments, versions or examples, which are included to enable a person having ordinary skill in the art to make and use the disclosure when combined with information and knowledge available to the person having ordinary skill in the art.

ADVANTAGES OF THE PRESENT DISCLOSURE
[00109] The present disclosure provides a system and a method to provide a shift aware loss metric.
[00110] The present disclosure provides a system and a method that facilitates to provide extremely robust and highly accurate models.
[00111] The present disclosure provides a system and a method that facilitates to save significant human annotation correction efforts by directly utilising misaligned ground truth data.
,CLAIMS:1. A system (208) for training a deep learning model for semantic segmentation of building footprints, comprising:
one or more processors (210); and
a memory (212) operatively coupled to the one or more processors (210), wherein the memory (212) comprises processor-executable instructions, which on execution, cause the one or more processors (210) to:
receive a plurality of aerial images and a plurality of ground truth-masked images of the building footprints;
determine a plurality of ground truth polygon masks in at least one ground truth-masked image and a plurality of boundary polygon masks in at least one aerial image using an image processing technique;
determine that at least one ground truth polygon mask is closely matched with each boundary polygon mask based on the determination;
compare the at least one closely matched ground truth polygon mask with said each boundary polygon mask;
determine a standard loss based on the comparison; and
train the deep learning model for sematic segmentation of the building footprints based on the standard loss.

2. The system (208) as claimed in claim 1, wherein the one or more processors (210) are to:
correlate the at least one aerial image and the at least one ground truth-masked image;
predict a binary segmentation mask of the at least one aerial image based on the correlation; and
determine the plurality of boundary polygon masks in the at least one predicted binary segmentation mask of the at least one aerial image using the image processing technique based on the prediction.

3. The system (208) as claimed in claim 1, wherein the one or more processors (210) are to:
compare said at least one ground truth polygon mask with said each boundary polygon mask; and
determine a distance transformation between said at least one ground truth polygon mask and said each boundary polygon mask based on the comparison.

4. The system (208) as claimed in claim 3, wherein the distance transformation replaces a value of each pixel in said each boundary polygon mask with a distance to a nearest boundary pixel within said each ground truth polygon mask.

5. The system (208) as claimed in claim 3, wherein the one or more processors (210) are to:
determine an error of said each boundary polygon mask based on the distance transformation;
compare the error of said each boundary polygon mask with said at least one ground truth polygon mask using the distance transformation;
determine a lowest error based on the comparison;
determine the at least one closely matched ground truth polygon mask based on the lowest error; and
assign the at least one closely matched ground truth polygon mask to said each boundary polygon mask.

6. The system (208) as claimed in claim 5, wherein the one or more processors (210) are to:
determine an overlap range between said each boundary polygon mask with said at least one ground truth polygon mask to determine the error.

7. The system (208) as claimed in claim 5, wherein the one or more processors (210) are to:
apply a transformation matrix between said each boundary polygon mask and the at least one closely matched ground truth polygon mask; and
determine the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask.

8. The system (208) as claimed in claim 7, wherein the one or more processors (210) are to select the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask to determine the standard loss.

9. The system (208) as claimed in claim 1, wherein the one or more processors (210) are to back propagate the determined standard loss to the deep learning model for training.

10. A method (1000) for training a deep learning model for semantic segmentation of building footprints, comprising:
receiving (1002), by one or more processors (210), a plurality of aerial images and a plurality of ground truth-masked images of the building footprints;
determining (1004), by the one or more processors (210), a plurality of ground truth polygon masks in at least one ground truth-masked image and a plurality of boundary polygon masks in at least one aerial image using an image processing technique;
determining (1006), by the one or more processors (210), that at least one ground truth polygon mask is closely matched with each boundary polygon mask based on the determination;
comparing (1008), by the one or more processors (210), the at least one closely matched ground truth polygon mask with said each boundary polygon mask;
determining (1010), by the one or more processors (210), a standard loss based on the comparison; and
training (1012), by the one or more processors (210), the deep learning model for semantic segmentation of the building footprints based on the standard loss.

11. The method (1000) as claimed in claim 10, wherein determining (1004), by the one or more processors (210), the plurality of boundary polygon masks in the at least one aerial image comprises:
correlating, by the one or more processors (210), the at least one aerial image and the at least one ground truth-masked image;
predicting, by the one or more processors (210), a binary segmentation mask of the at least one aerial image based on the correlation; and
determining, by the one or more processors (210), the plurality of boundary polygon masks in the at least one predicted binary segmentation mask of the at least one aerial image using the image processing technique based on the prediction.

12. The method (1000) as claimed in claim 10, wherein determining (1006), by the one or more processors (210), that the at least one ground truth polygon mask is closely matched with said each boundary polygon mask based on the determination comprises:
comparing, by the one or more processors (210), said at least one ground truth polygon mask with said each boundary polygon mask; and
determining, by the one or more processors (210), a distance transformation between said at least one ground truth polygon mask and said each boundary polygon mask based on the comparison.

13. The method (1000) as claimed in claim 12, wherein the distance transformation replaces a value of each pixel in said each boundary polygon mask with a distance to a nearest boundary pixel within said at least one ground truth polygon mask.

14. The method (1000) as claimed in claim 12, comprising:
determining, by the one or more processors (210), an error of said each boundary polygon mask based on the distance transformation;
comparing, by the one or more processors (210), the error of said each boundary polygon mask with said at least one ground truth polygon mask using the distance transformation;
determining, by the one or more processors (210), a lowest error based on the comparison;
determining, by the one or more processors (210), the at least one closely matched ground truth polygon mask based on the lowest error; and
assigning, by the one or more processors (210), the at least one closely matched ground truth polygon mask to said each boundary polygon mask.

15. The method (1000) as claimed in claim 14, wherein determining, by the one or more processors (210), the error of said each boundary polygon mask comprises:
determining, by the one or more processors (210), an overlap range between said each boundary polygon mask with said at least one ground truth polygon mask to determine the error.

16. The method (1000) as claimed in claim 14, wherein determining, by the one or more processors (210), the lowest error comprises:
applying, by the one or more processors (210), a transformation matrix between said each boundary polygon mask and the at least one closely matched ground truth polygon mask; and
determining, by the one or more processors (210), the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask.

17. The method (1000) as claimed in claim 16, wherein determining (1010), by the one or more processors (210), the standard loss based on the comparison comprises:
selecting, by the one or more processors (210), the lowest error of the at least one closely matched ground truth polygon mask corresponding to said each boundary polygon mask.

18. The method (1000) as claimed in claim 10, wherein training (1012), by the one or more processors (210), the deep learning model for semantic segmentation of the building footprints based on the standard loss comprises:
back propagating, by the one or more processors (210), the determined standard loss to the deep learning model.

Documents

Application Documents

#	Name	Date
1	202221068924-STATEMENT OF UNDERTAKING (FORM 3) [30-11-2022(online)].pdf	2022-11-30
2	202221068924-PROVISIONAL SPECIFICATION [30-11-2022(online)].pdf	2022-11-30
3	202221068924-POWER OF AUTHORITY [30-11-2022(online)].pdf	2022-11-30
4	202221068924-FORM 1 [30-11-2022(online)].pdf	2022-11-30
5	202221068924-DRAWINGS [30-11-2022(online)].pdf	2022-11-30
6	202221068924-DECLARATION OF INVENTORSHIP (FORM 5) [30-11-2022(online)].pdf	2022-11-30
7	202221068924-ENDORSEMENT BY INVENTORS [30-11-2023(online)].pdf	2023-11-30
8	202221068924-DRAWING [30-11-2023(online)].pdf	2023-11-30
9	202221068924-CORRESPONDENCE-OTHERS [30-11-2023(online)].pdf	2023-11-30
10	202221068924-COMPLETE SPECIFICATION [30-11-2023(online)].pdf	2023-11-30
11	202221068924-FORM 18 [17-01-2024(online)].pdf	2024-01-17
12	202221068924-FORM-8 [19-01-2024(online)].pdf	2024-01-19
13	Abstract1.jpg	2024-03-07
14	202221068924-FER.pdf	2025-07-10

Search Strategy

1	202221068924_SearchStrategyNew_E_8924E_11-04-2025.pdf