Systems And Methods For Indoor Layout Estimation Using Self Attention

< Back

Systems And Methods For Indoor Layout Estimation Using Self Attention And Adversarial Learning

Abstract: This disclosure provides systems and methods for indoor layout estimation using self-attention and adversarial learning. A self-attention and adversarial learning based architecture is used in the present disclosure to generate high quality occupancy maps from a RGB image for indoor scenes in real time. The self-attention and adversarial learning based architecture is dynamically updated during training using an optimized loss function which is a combination of weighted cross entropy loss, boundary loss, and GAN loss. The generated occupancy maps are used to estimate layout of indoor environment. The generated occupancy maps are often crucial for path-planning and mapping in the indoor environments but are often built using only information contained in the ego view. In the method of the present disclosure, occupancy values are predicted beyond immediately visible regions from just a monocular image, leveraging learnt priors from indoor scenes.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

20 October 2022

Publication Number

17/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. ROYCHOUDHURY, Ruddra Dev

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata - 700160, West Bengal, India

2. BHOWMICK, Brojeshwar

3. KRISHNAN, Madhava Krishna

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

4. SINGH, Shantanu

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

5. KARIYATT, Jaidev Shriram

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

6. KULKARNI, Shaantanu

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

Specification

DESC:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
SYSTEMS AND METHODS FOR INDOOR LAYOUT ESTIMATION USING SELF-ATTENTION AND ADVERSARIAL LEARNING

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[001] The present application claims priority from Indian provisional patent application no. 202221059890, filed on October 20, 2022. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD
The disclosure herein generally relates to the field of indoor layout estimation, and, more particularly, to systems and methods for indoor layout estimation using self-attention and adversarial learning.

BACKGROUND
Indoor layout estimation refers to estimating occupancy map of a scene, its navigable and non-navigable areas. In recent years, this field has gained traction, motivated primarily by its applications in several robotics tasks, such as SLAM, exploration, and indoor navigation. Layouts that are typically used for such tasks are limited to information contained in an ego-view and do not consider priors that humans can easily apply to a scene. Alternatively, humans can predict an extended or amodal layout of a scene and guess presence of free space or obstacles behind occluding surfaces like furniture. While there has been sufficient traction for amodal layout estimation for on-road scenes in context of autonomous driving, there have been very limited efforts for indoor scenes like offices and home spaces. Further, this task is far from trivial as indoor layouts are arguably more complex and diverse in nature, compared to typical outdoor layouts, where shape of a vehicle and surrounding environment are largely consistent. Furthermore, in indoor scenes, the layout of a single room itself can change drastically depending on viewing angle, obstructions present, and position of the robot. While few conventional learning based approaches are used for indoor layouts, but they use depth sensors at inference time which incurs an additional expense.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method is provided. The processor implemented method comprising acquiring in real time, via one or more processors, an image associated with one or more views of one or more scenes in an indoor environment using an image capturing device mounted on a robotic device; training, via the one or more hardware processors, a self-attention and adversarial learning driven network, wherein the self-attention and adversarial learning driven network comprising: an encoder-decoder sub-network configured to receive the image associated with the one or more views of the one or more scenes in the indoor environment as an input, wherein the encoder-decoder sub-network comprises an encoder, a self-attention module, and a decoder; and a patch-based discriminator sub-network in communication with the encoder-decoder sub-network; generating, via the one or more hardware processors, a plurality of occupancy maps of the indoor environment using the trained self-attention and adversarial learning driven network; dynamically updating, via the one or more hardware processors, one or more parameters of the trained self-attention and adversarial learning driven network based on an optimized loss function, wherein the optimized loss function is obtained based on a first loss type, a second loss type and a third loss type; generating, via the one or more hardware processors, a plurality of updated occupancy maps of the indoor environment using the dynamically updated trained self-attention and adversarial learning driven network; estimating, via the one or more hardware processors, a layout of the indoor environment based on the plurality of updated occupancy maps; and enabling, via the one or more hardware processors, the patch-based discriminator sub-network to quantify correctness of estimated layout of the indoor environment with respect to a ground truth using one or more metrics.
In another aspect, a system is provided. The system comprising a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to acquire in real time, an image associated with one or more views of one or more scenes in an indoor environment using an image capturing device mounted on a robotic device; train, a self-attention and adversarial learning driven network, wherein the self-attention and adversarial learning driven network comprising: an encoder-decoder sub-network configured to receive the image associated with the one or more views of the one or more scenes in the indoor environment as an input, wherein the encoder-decoder sub-network comprises an encoder, a self-attention module, and a decoder; and a patch-based discriminator sub-network in communication with the encoder-decoder sub-network; generate, a plurality of occupancy maps of the indoor environment using the trained self-attention and adversarial learning driven network; dynamically update, one or more parameters of the trained self-attention and adversarial learning driven network based on an optimized loss function, wherein the optimized loss function is obtained based on a first loss type, a second loss type and a third loss type; generate, a plurality of updated occupancy maps of the indoor environment using the dynamically updated trained self-attention and adversarial learning driven network; estimate, a layout of the indoor environment based on the plurality of updated occupancy maps; and enable, the patch-based discriminator sub-network to quantify correctness of estimated layout of the indoor environment with respect to a ground truth using one or more metrics.
In yet another aspect, a non-transitory computer readable medium is provided. The non-transitory computer readable medium are configured by instructions for acquiring in real time, an image associated with one or more views of one or more scenes in an indoor environment using an image capturing device mounted on a robotic device; training, a self-attention and adversarial learning driven network, wherein the self-attention and adversarial learning driven network comprising: an encoder-decoder sub-network configured to receive the image associated with the one or more views of the one or more scenes in the indoor environment as an input, wherein the encoder-decoder sub-network comprises an encoder, a self-attention module, and a decoder; and a patch-based discriminator sub-network in communication with the encoder-decoder sub-network; generating, a plurality of occupancy maps of the indoor environment using the trained self-attention and adversarial learning driven network; dynamically updating, one or more parameters of the trained self-attention and adversarial learning driven network based on an optimized loss function, wherein the optimized loss function is obtained based on a first loss type, a second loss type and a third loss type; generating, a plurality of updated occupancy maps of the indoor environment using the dynamically updated trained self-attention and adversarial learning driven network; estimating, a layout of the indoor environment based on the plurality of updated occupancy maps; and enabling, the patch-based discriminator sub-network to quantify correctness of estimated layout of the indoor environment with respect to a ground truth using one or more metrics.
In accordance with an embodiment of the present disclosure, the encoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is a convolution-based encoder and configured to extract a plurality of features from the image associated with the one or more views of the one or more scenes in the indoor environment.
In accordance with an embodiment of the present disclosure, the self-attention module comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network uses a single transformer block with multi-head attention and configured to project and aggregate the plurality of features across different patches in a context dependent manner.
In accordance with an embodiment of the present disclosure, the decoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is configured to classify the generated plurality of occupancy maps into one of: (a) an unknown category, (b) an occupied category, and (c) a navigable category.
In accordance with an embodiment of the present disclosure, the first loss type is weighted cross entropy loss, second loss type is boundary loss, and the third loss type is generative adversarial network (GAN) loss.
In accordance with an embodiment of the present disclosure, wherein the estimated layout of the indoor environment is a top down bird view layout.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary system for indoor layout estimation using self-attention and adversarial learning according to some embodiments of the present disclosure.
FIG. 2 illustrates an exemplary flow diagram illustrating a method for indoor layout estimation using self-attention and adversarial learning in accordance with some embodiments of the present disclosure.
FIG. 3 is a functional block diagram of an architecture of a self-attention and adversarial learning driven network for indoor layout estimation according to some embodiments of the present disclosure.
FIG. 4 depicts qualitative results providing a comparison of the method of the present disclosure with state of the art approaches for qualitative evaluation according to some embodiments of the present disclosure.
FIG. 5 depicts a visualization of saliency maps from various layers of the system of present disclosure for indoor layout estimation using self-attention and adversarial learning according to some embodiment of the present disclosure.
FIG. 6 depicts registering a map for an entire scene using the generated plurality of occupancy maps to evaluate utility of the method of the present disclosure in mapping tasks according to some embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Humans have a remarkable ability to navigate through new indoor spaces based on knowledge acquired by traversing similar scenes in the past. As a result, one can easily infer multiple properties about a given location by leveraging these priors, such as semantic configuration of a room, proximity to adjacent spaces (e.g., a kitchen may occur near the dining room), the layout of the scene. Layout is a catchall phrase that simultaneously refers to floorplans, Manhattan-world 3D room layouts, and occupancy maps. One conventional approach on floorplans uses a 3D point cloud as input to produce a polygonized floorplan of the indoor scene. Another convention approach uses floorplans as a prior along with Red-Green-Blue (RGB) images to predict the 3D room layout of a scene by exploiting geometry of the scene. However, these representations often fail to capture presence of obstacles in the scene, which is essential for downstream tasks such as robot navigation. Further, they often require the use of floorplans or 3D scans of the scene at inference, which is not easy to obtain.
The present disclosure focusses on occupancy map prediction from a monocular RGB image, which is easier to use on real robots. Occupancy maps have been extensively used in robotics, particularly for mapping, navigation and planning Few conventional approaches for mapping used Light Detection and Ranging (LiDAR) and sensor fusion to build occupancy maps but recently, deep learning approaches have shown great success using just RGB images, particularly in outdoor scenes. However, learning based approaches for indoor layouts are however, relatively new and are often used as a proxy for other tasks such as pint navigation and object navigation. Few learning based approaches on these tasks use depth sensors at inference time, which incurs an additional expense. However, the present disclosure focusses on improving layout predictions using just RGB images and surpass relevant current state of the art approaches, while also showing improvements on point navigation.
Amodal Layout Prediction: Occupancy maps generated from single views are often incomplete, lacking any information beyond what is immediately visible in the RGB image. Humans, however, can hallucinate beyond this and use prior information to reason about the occluded areas as well, predicting the amodal layout. Conventionally, learning based approaches can reasonably predict presence of cars and roads beyond visible regions using just RGB images in outdoor scenes. In indoor environments, amodal layout estimation using conventional approaches include prediction of occluded regions and in some cases, semantic classes. Further, occupancy for a few selected semantic classes is hallucinated in the conventional approaches. However, the method of the present disclosure does not discriminate between any, and predict occupancy for all objects. In one existing approach for amodal layout estimation in indoor environment, a network is trained to recreate visible layout seen from a higher vantage point using a RGB image and visible layout from a lower height, obtained using a depth sensor. However, the present disclosure uses only a monocular image and predict a true bird’s eye view of layout of indoor environment.
Transformers for Image Synthesis: Generating bird’s eye view images from a monocular camera is fundamentally ill-posed as RGB images lack concrete information about the depth of the scene. Hence, the present disclosure is more aligned with problems in image translation domain where a new image is generated given a guiding image as input. There are some approaches that use attention to guide their generative network that translates one view to another, given a semantic map of a target view as guidance. However, the present disclosure proposes a self-attention and adversarial learning driven network to directly predict bird’s eye view of the layout of an indoor environment using just a monocular image and without using any secondary image for guidance.
Embodiments of the present disclosure provide systems and methods for indoor layout estimation using self-attention and adversarial learning. A learning based approach named as IndoLayout is proposed in the present disclosure to generate high quality occupancy maps from a RGB image for indoor scenes in real time. These occupancy maps are often crucial for path-planning and mapping in indoor environments but are often built using only information contained in the ego view. In contrast, the method of the present disclosure also predicts occupancy values beyond immediately visible regions from just a monocular image, leveraging learnt priors from indoor scenes. Hence, the system of the present disclosure can produce a hallucinated, amodal layout of an indoor scene that includes areas occluded in the RGB image, such as a navigable floor behind a desk. A novel architecture that uses self-attention and adversarial learning is used in the present disclosure to vastly improve the quality of the predicted layout. More Specifically, the present disclosure describes the following:
Implementation of a light-weight architecture that beats existing state-of-the-art on the amodal occupancy map representation estimation by effectively leveraging attention and adversarial learning.
Significant improvements over state-of-the-art methods on three large challenging indoor datasets namely Gibson, Matterport 3D, and HM3D is demonstrated.
Importance of analyzing layout quality by adopting two new metrics for the indoor layout estimation and also surpass prior work is demonstrated.
Application of IndoLayout to the point navigation task and superior results over comparable methods are shown.
Referring now to the drawings, and more particularly to FIGS. 1 through 6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary system 100 for indoor layout estimation using self-attention and adversarial learning according to some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is/are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W 5 and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 comprises received image corresponding to the one or more views of the one or more scenes in the indoor environment. The database 108 further stores information on the scene in the indoor environment, occupancy maps, ground truth, and estimated layout of the indoor environment.
The database 108 further comprises one or more networks such as one or more neural network(s), encoder-decoder sub-network, patch-based discriminator sub-network, and/or the like which when invoked and executed perform corresponding steps/actions as per the requirement by the system 100 to perform the methodologies described herein. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
FIG. 2, with reference to FIG. 1, depicts an exemplary flow chart illustrating a method 200 for indoor layout estimation using self-attention and adversarial learning, using the system 100 of FIG. 1, in accordance with an embodiment of the present disclosure.
Referring to FIG. 2, in an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to components of the system 100 of FIG. 1, the block diagram of FIG. 2, and the flow diagram as depicted in FIG. 3. In an embodiment, at step 202 of the present disclosure, the one or more hardware processors 104 are configured to acquire in real time, an image corresponding to one or more views of one or more scenes in an indoor environment using an image capturing device (e.g., a monocular camera) mounted on a robotic device (e.g., robot, an agent, unmanned aerial vehicle (UAV), and the like), in the form of monocular Red, Green, Blue (RGB) image.
In an embodiment, at step 204 of the present disclosure, the one or more hardware processors 104 are configured to train a self-attention and adversarial learning driven network. FIG. 3, with reference to FIGS. 1-2, depicts a functional block diagram of an architecture of the self-attention and adversarial learning driven network as implemented by the system of 100 of FIG. 1 for indoor layout estimation, in accordance with an embodiment of the present disclosure. The self-attention and adversarial learning driven network comprises an encoder-decoder sub-network configured to receive the image corresponding to the one or more views of the one or more scenes in the indoor environment as an input and a patch-based discriminator sub-network in communication with the encoder-decoder sub-network. The encoder-decoder sub-network comprises an encoder, a self-attention module, and a decoder. In an embodiment, the encoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is a convolution-based encoder and configured to extract a plurality of features from the image corresponding to the one or more views of the one or more scenes in the indoor environment. As depicted in the block diagram of FIG. 3, the encoder receives the image corresponding to the one or more views of the one or more scenes in the indoor environment as an input. In an embodiment, first 4 blocks of a residual network (ResNet-18) pretrained on a known dataset (i.e., ImageNet) is used as the encoder, followed by convolution and max-pooling layers to reduce an input map from a resolution of 3 × 512 × 512 to 128 × 8 × 8. Instead of adopting a transformer-based encoder, to generate patch embeddings, a convolution-based encoder is used in the present disclosure since they are more efficient for finetuning with small datasets due to their inductive biases. They also reduce computation overhead by reducing the number of patches for subsequent self-attention module which allows for faster training and inference times. The encoder extracts a plurality of features from the received image.
In an embodiment, the self-attention module comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network uses a single transformer block with multi-head attention and configured to project and aggregate the plurality of features across different patches in a context dependent manner. The plurality of features extracted by the encoder carry spatial nature of perspective view and need to be transformed into space more relevant to top view. The self-attention module is used to project and aggregate the plurality of features across different patches in a context dependent manner. As shown in FIG. 3, a single transformer block with multi-head attention is used as the self-attention module in the present disclosure, as empirical results with more number of blocks gave marginal improvements despite higher training/inference costs. Co-relation values are calculated from the plurality of extracted features by self-attention module to assist in mapping free and occupied space in the received image. Further, a learned positional embedding is added to the plurality of extracted features to provide additional context for feature aggregation in the self-attention module.
In an embodiment, at step 206 of the present disclosure, the one or more hardware processors 104 are configured to generate a plurality of occupancy maps of the indoor environment using the trained self-attention and adversarial learning driven network. The decoder receives the projected and aggregated plurality of features that are computed by the single transformer block of the self-attention module and iteratively applies a series of convolutions, followed by BatchNorm, rectified linear activation function (ReLU) activation, and upsampling layers to produce a final output of shape 3× 128×128, where each channel corresponds to a probability of being unexplored, occupied, or free respectively, after applying a Softmax function. In an embodiment, the decoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is configured to classify the generated plurality of occupancy maps into one of: (a) an unknown category, (b) an occupied category, and (c) a navigable category.
Referring to steps of FIG. 2 of the present disclosure, at step 208, the one or more hardware processors 104 are configured to dynamically update one or more parameters of the trained self-attention and adversarial learning driven network based on an optimized loss function. The one or more parameters are the parameters of the neural networks used for training, which may include but not limited to convolution layers in the encoder-decoder blocks and the self-attention module, feedforward layer of the transformer block, and the convolution layers in the patch-based discriminator. Each of the convolutional layers, attention blocks and discriminator layers refer to a matrix of weights that gets modified in each iteration of training. The three losses in combination help update these weight values in each of the layers via back propagation. The optimized loss function is obtained based on a first loss type, a second loss type and a third loss type. In an embodiment, the first loss type is weighted cross entropy loss, second loss type is boundary loss, and the third loss type is generative adversarial network (GAN) loss. The optimized loss function is a combination of the weighted cross entropy loss, boundary loss, and GAN loss.
The weighted cross entropy computed is over the three categories (i.e., the unknown category, the occupied category, and the navigable category) of output using ground truth supervision. Boundary Loss that penalizes misclassification around boundary of objects in particular is calculated by applying a L1 loss with a ground truth boundary and a spatial gradient of output as provided in equation (1) below:
L_bdry=|??y ^ ?_2-y_bdry |; L_CE=-?_(j=1)^3¦?y_j log?(y ^_j ) ? (1)
Here, y_bdry refers to contours/boundary of ground truth layouts and y ^ refers to an estimated layout. The spatial gradient (?y ^) computes a gradient in the x and y directions separately, which is then combined by calculating its norm. GAN loss train the patch-based discriminator sub-network and provides additional supervision to the decoder in accordance with equation (2) and (3) provided below:
L_discr=E_(y~P_true ) ?D(y)?+E_(y~P_fake ) ?D(y ^ )? (2)
L_gen=E_(x~P_true ) ?D(y ^ )? (3)
Here, (x ) ^is the estimated layout, corresponding to fake distribution and x is the ground truth layout, corresponding to true distribution. The final optimized loss function can be expressed as shown in equation (4) below:
L=L_CE+?_bdry L_bdry+?_gen L_gen (4)
where ?_gen and ?_bdry are the weight assigned to corresponding losses L_gen and L_bdry. It is found that ?_bdry= 0.001 and ?_gen= 0.01 gives us the best results.
Further, at step 210 of the present disclosure, the one or more hardware processors 104 are configured to generate a plurality of updated occupancy maps of the indoor environment using the dynamically updated trained self-attention and adversarial learning driven network. In an embodiment, at step 212, the one or more hardware processors 104 are configured to estimate a layout of the indoor environment based on the plurality of updated occupancy maps. The estimated layout of the indoor environment is a top down bird view layout.
Referring to steps of FIG. 3 of the present disclosure, at step 214, the one or more hardware processors 104 are configured to enable the patch-based discriminator sub-network to quantify correctness of the estimated layout of the indoor environment with respect to a ground truth using one or more metrics. In an embodiment, the one or more metrics may include but not limited to IoU and F1 score. In an embodiment, training a network with just a per pixel loss such as binary cross entropy may not always produce outputs that are structurally coherent, as typical perception of objects and layouts in groups of pixels or patches. Further, the shape of objects in the estimated layout may be irregular without including any priors about their typical shape. Thus, the patch-based discriminator sub-network is employed to distinguish between the estimated layout and the ground truth layouts. The patch-based discriminator sub-networks takes a 3 × 128 × 128 estimated layout as input and outputs a label for various patches, corresponding to real or fake. Due to high variance in indoor layouts, this adversarial loss is not computed, instead a corresponding ground truth is used. The GAN loss or the patch-based discriminator sub-network basically guides the decoder to do the correct estimation/prediction by helping to train the weights of the decoder to correct values. The decoder does the estimation/prediction, but the GAN loss guides it to do the correct estimation/prediction.
In an embodiment, the system 100 performing the above steps of the method depicted in FIG. 2 and 3 may be either an integral part of the robotic device (e.g., drone, robot, UAV, and the like) or externally attached/communicatively connected to the robotic device via one or more I/O interfaces (e.g., input/output interfaces as known in the art).
Experiments Conducted:
IndoLayout Scene Dataset: The method of the present disclosure is evaluated on three different datasets namely Gibson, Matterport3D, and HM3D using a Habitat simulator. For the Gibson dataset, results on the Gibson Tiny split and Gibson 4+ is resulted, which is a filtered version of Gibson, consisting of scenes rated 4 or above by human evaluators, based on the texture and mesh quality of the scene.
Trajectory Generation:
In the present disclosure, training and validation splits for the Gibson, Matterport3D, and HM3D datasets are generated by programming an agent in Habitat simulator. Specifically, an agent is spawn one meter above the ground, and a continuous trajectory is captured as the agent maps the entire scene. The agent maximises coverage by choosing nearby areas that have yet to be mapped while avoiding movement near walls and obstacles. The objective of mapping ensures that the dataset sed in the present disclosure includes a variety of different semantic classes and rooms, as well as diverse layouts that a typical robot may see during navigation. The agent is equipped with a RGB sensor of 512 × 512 resolution and a local map sensor that extracts a 128 × 128 occupancy map at a scale of 0.025 metres per pixel, or 3.2m × 3.2m.
Layout Generation:
In an embodiment, the raw output of the local map sensor is ill-suited for the indoor navigation task as it includes areas that are outside the field of view. For instance, it is not feasible to guess the layout of rooms behind a closed door, which is included in the raw sensor output. Hence, such regions are masked out from raw occupancy map using a known in the art technique. Further, rays from the agent are projected until it hits a wall and exclude areas not covered by the ray. In the present disclosure, only walls or other tall view-obstructing objects are used for masking purposes, due to which the estimated layouts of the indoor environment still include occlusions induced by furniture and other objects. Rays are shot with a 120? FOV, as opposed to the camera’s 90? FOV, which further increases amount of hallucinated area. Finally, the mask is dilated to further increase coverage. All areas outside the mask are marked as unknown and pixels within the mask are limited to either occupied or free. After generating the trajectories and layouts for all scenes, potentially noisy samples are filtered out by removing images that are upfront against obstacles based on the maximum depth visible. Also, images are removed where the unknown region is more than 90% of the image. Some statistics for the final datasets used are provided below:
Gibson 4+: It consists of 72 training and 12 validation scenes, (18,435 training images, 2955 validation images).
2) Gibson Tiny: It consists of 25 training and 5 validation scenes, (8,176 training images, 1360 validation images).
3) Matterport: It consists of 11 large and varying validation scenes, (7,885 validation images).
4) HM3D: It consists of 100 validation scenes, (32,470 validation images). In the present disclosure, only out of the box validation scores on Matterport and HM3D for all models are reported to evaluate the robustness of the system of the present disclosure.
Approaches Evaluated: The effectiveness of the method of the present disclosure is evaluated based on a comparison with following state of the art approaches:
Active Neural SLAM (ANS) RGB: A monocular indoor layout estimation method.
OccAnt RGB: A monocular indoor layout estimation method.
Additionally, scores of OccAnt RGBD, a state of the art on indoor layout estimation are reported as a benchmark. This approach uses both RGB and depth information during training and at inference time. FIG. 4 depicts qualitative results providing a comparison of the method of the present disclosure with state of the art approaches for qualitative evaluation according to some embodiments of the present disclosure. It is observed from FIG. 4 that ANS (RGB) predicts only visible occupancy. Also, it is observed from FIG. 4 that IndoLayout captures shape of objects more accurately when compared against OccAnt (RGB), and preserves free narrow regions for indoor navigation in tight environments. All input image examples shown in FIG. 4 are from the Gibson 4+ validation set. The final row shown in FIG. 4 corresponds to a challenging instance for all approaches, where even IndoLayout is unable to produce sharp boundaries, likely due to the complexity of the scene.
Evaluation Metrics: The present disclosure quantitatively evaluates the performance of all approaches against the ground truth layouts. The present disclosure reports mean Intersection-over-Union (mIoU) and mean Average Precision (mAP) metrics. Upon analysis, it is found that mIoU effectively captures minimum of precision and recall, and hence, additionally report a mean F1 score. Each metric is reported for the occupied and navigable/free class in the layouts. While these metrics capture efficacy of the method of the present disclosure, they may not capture visual quality of the layouts well. Optimising for IoU or F1 in particular, can have an effect of producing rounded edges, when the layout of objects in bird’s eye view are typically sharper. This is particularly true for large objects, where small errors along the boundary will have a minimal contribution to the loss function. Therefore, a Boundary IoU is reported for each class which is a more sensitive metric that focuses on boundary quality alone. Boundary metric for segmentation is also proposed in a state in the art approach, that measured the difference in tangent angles between contours on the predicted and ground truth segmentation, but such a metric is more suited for building segmentation than the scenario considered in the present disclosure, due to polygonal nature of buildings. Additionally, a structure similarity (SSIM) score is reported in the present disclosure which considers similarity in structure between two images.
Experimental results:
Layout estimation: The present disclosure reports experimental results for indoor layout estimation in terms of performance on evaluation metrics and quality of generated outputs as provided below:
1) Performance on Evaluation Metrics: To evaluate the performance of the system of the present disclosure on the task of layout estimation, generated/predicted local occupancy maps are compared against the ground truths and correctness of these predictions is quantified using the one or more metrics such as IoU and F1 score metrics. A comparison of the performance of the system of present disclosure on the Gibson4+ and Gibson Tiny dataset is provided in Table 1.
Gibso n 4+ ANS (RGB) Yes 33.04 32.67 32.85 77.81 69.72 73.76 48.23 45.84 47.03 49.23 14.65
OccAnt (RGB) Yes 52.87 61.32 57.09 70.02 72.33 71.17 68.07 74.08 71.07 66.19 36.42
IndoLayout Yes 59.06 67.84 63.45 71.92 83.01 77.46 72.96 74.02 73.49 69.37 39.06
OccAnt (RGBD) No 69.63 71.54 70.58 83.01 81.02 82.01 81.5 82.15 81.82 74.75 54.02
Gibson Tiny ANS (RGB) Yes 29.21 36 32.60 70.27 72.27 71.27 43.84 49.6 46.72 47.99 14.46
OccAnt (RGB) Yes 47.6 61.1 54.35 63.13 72.56 67.84 63.22 73.84 68.53 64.16 33.16
Indo layout Yes 52.3 64.49 58.39 64.13 77.8 70.96 67.53 76.94 72.23 66.45 34.96
OccAnt (RGBD) No 70.8 71.96 71.38 85.45 80.5 82.975 82.36 82.24 82.3 73.2 51.25
Matterport ANS (RGB) Yes 24.1 34.32 29.21 66.29 77.06 71.67 37.24 48.06 42.65 42.62 12.11
OccAnt (RGB) Yes 43.02 63.3 53.16 63.53 76.45 69.99 58.08 75.65 66.86 63.79 33.44
IndoLayout Yes 49.48 66.39 57.93 64.61 81.62 73.11 64.55 78.12 71.33 67.34 36.44
OccAnt (RGBD) No 67.53 74.68 71.10 82.04 83.25 82.64 79.69 84.12 81.90 74.92 53.33
HM3D ANS (RGB) Yes 31.53 35.61 33.57 80.67 70.84 75.755 46.65 48.88 47.765 49.73 13.52
OccAnt (RGB) Yes 53.17 62.85 58.01 71.9 71.68 71.79 68.26 75.15 71.705 66.61 35.79
IndoLayout Yes 57.02 66.23 61.625 71.64 76.19 73.915 71.57 77.86 74.715 69.22 37.6
OccAnt (RGBD) No 71.55 71.68 71.615 85.91 78.84 82.375 82.84 81.9 82.37 74.98 53.13
Dataset Method Only RGB? Occ Free Mean Occ Free Mean Occ Free Mean
mIoU % maP % F1 % SSIM Boundary IoU%
Table 1
All the systems are trained on a training split for both the datasets and then evaluated on a separate validation split. It is observed from Table 1 that among the RGB only networks/systems, the system of the present disclosure (i.e., IndoLayout) is substantially better than other baselines in its prediction for both the occupied as well as the free space. The system of the present disclosure reduces gap in the performance compared to RGBD network/system, thus alleviating a penalty incurred for tasks where only a monocular setup can be used. In Table 1, the out of the box performance is reported on the validation splits for Matterport and HM3D datasets. Again, it is observed that the system of the present disclosure outperforms the other RGB networks/systems and generalizes better to novel scenes. For better understanding of better performance of the system of the present disclosure, saliency maps computed using GradCAM for all the networks/systems as well as the attention map of the system of present disclosure are inspected. This gives an insight into which regions do the models focus on, for a given image, while generating the corresponding local occupancy maps. FIG. 5 depicts a visualization of saliency maps from various layers of the system of present disclosure for indoor layout estimation using self-attention and adversarial learning according to some embodiment of the present disclosure. It is observed in FIG. 5 that IndoLayout focuses on relevant objects for the class being predicted. It is shown in FIG. 5 that due to attention, the system of present disclosure is able to focus on relevant surfaces while predicting the target classes.
2) Quality of Generated Outputs: IndoLayout produces vastly better layouts qualitatively compared to prior art, as shown in FIG. 4. It is observed that the conventional approaches often produce blurry outputs with more rounded edges, as compared to the method pf the present disclosure. This analysis is confirmed by the reported SSIM and Boundary IoU metrics in Table 1, on which significant improvements by the method of the present disclosure over conventional approaches are being shown. This gain is attributed to the use of self-attention and a discriminator that operate at the patch level while attending to a global context.
Amodal estimation: The present disclosure demonstrated that IndoLayout is able to consistently perform better than its counterparts. Here, investigation of the reason behind this boost is conducted by evaluating performance on a hallucinated area and visible occupancy separately. It is found that percentage of pixels hallucinated by the method of the present disclosure is 4% less than OccAnt RGB, but 6% more accurate within this area. Since the percentage of hallucinated pixels was calculated using the model’s output itself, the higher accuracy obtained by the method of present disclosure shows that performance of the method of present disclosure is better within the hallucinated area. This improvement is particularly important for planners that may use the amodal layout predicted. Further, it is found that the system of present disclosure is 4% more accurate within the visible occupancy region as well, which is critical for navigation purposes.
Ablation Studies: The present disclosure reports ablation study in terms of importance of attention, and effect of adversarial learning as provided below:
1) Importance of Attention: Table 2 provides an examination of role of each component in IndoLayout by comparing the performance gain over a base model (encoder-decoder sub network) for both IOU and mAP scores.
mIoU % mAP %
Method Occ Free Mean Occ Free Mean
Base 52.59 64.09 58.34 70.84 73.13 71.98
w/Boundary loss
w/Discriminator
w/Self-attention 54.2 63.3 58.75 69.9 74.8 72.35
54.4 64.4 59.4 72.5 75.3 73.9
56.79 64.24 60.51 68.97 77.85 73.41
IndoLayout (All) 59.06 67.84 63.45 71.92 83.01 77.46
Table 2
It is observed from Table 2 that self-attention has a substantial impact on the overall performance of the system of the present disclosure, followed by the boundary loss and the discriminator. It is observed from Table 2 that adding self-attention to the system of present disclosure significantly improves performance, with the results most pronounced on the Gibson 4+ dataset. This gain is attributed to expressive power of self-attention, which has been well-established for vision-related tasks. Using self-attention, the system of the present disclosure can learn the global context in addition to local features, which is critical for such a task as distant pixels may help contextualize local patches. Further, the saliency maps visualized in FIG. 5 support this theory that attention is able to focus on relevant regions in the input image more effectively.
2) Effect of Adversarial Learning: By using a patch based discriminator, a 2% improvement in IoU for the occupied class is observed in the present disclosure. In addition, an improvement in visual quality for several images is observed as shown in FIG. 4. However, the visual improvements are not as pronounced as in outdoor scenarios. This is attributed in the present disclosure to typical shape of vehicles and road layouts, which belong to a smaller distribution than indoor layouts. Upon inspection, it is found that objects belonging to same semantic class itself, such as a dining table, can have vastly varying layouts, in contrast to outdoor scenario, where vehicle shapes are largely similar. Further, shape of navigable areas in indoor scenes is highly dependent on furniture placements and room size, making it more challenging to regularize the output using existing layouts.
Application: The present disclosure may be used for multiple applications in the field of robotics such as Point navigation task ( i.e., PointNav) and mapping. To establish the significance of improvements of the method of the present disclosure, IndoLayout is applied to the PointNav task where an agent in Habitat simulator has to pathfind and move to a target location without any prior global map. In a conventional system, it is shown that hallucination can significantly help in path planning and navigation for the PointNav task. Hence, in the present disclosure, the layout module used in the RL pipeline of the conventional system is simply replace with IndoLayout, pretrained on Gibson4+, and the performance of the system of the present disclosure is evaluated. Similar to the baselines, a comparison of the IndoLayout against the RGB variant of the conventional approach is done. The present disclosure does not attempt to beat a state of the art method on PointNav, which is ‘Differentiable SLAM-net’, as it uses a different approach entirely, but rather demonstrate how better layout estimators can significantly improve navigation solutions that do use such layouts. Table 3 shows considerable improvement of the method of present disclosure over all baselines, showing the significance and utility of the method of present disclosure.
Difficulty Method Success rate Success weighted by path length (SPL) Time

Easy ANS RGB 0.851 0.676 154.248
Occant RGB 0.888 0.715 135.498
IndoLayout 0.913 0.731 127.105

Medium ANS RGB 0.626 0.488 283.329
Occant RGB 0.698 0.532 261.636
IndoLayout 0.763 0.566 233.803

Hard ANS RGB 0.303 0.239 429.541
Occant RGB 0.339 0.248 417.312
IndoLayout 0.431 0.337 383.33

Overall ANS RGB 0.7 0.552 236.51
Occant RGB 0.752 0.59 217.288
IndoLayout 0.8 0.621 198.246
Table 3
In the present disclosure, it is found that by simply replacing the layout module, a 5% improvement on success rate and success weighted by path length (SPL) is observed, where standard metrics are used for this task. Further, the method of the present disclosure performs significantly better for medium and hard episodes in the validation set, with a 7% higher success rate than OccAnt RGB on medium episodes and 9% higher success rate on hard episodes. Similar trends for other reported metrics are also observed as shown in Table 4. Since these episodes involve navigation across longer distances, the performance improvements suggest that the estimated indoor layouts in the present disclosure are more suited for planning and navigation purposes. However, the system of the present disclosure does not train policy from scratch, and expect even greater improvements if trained with IndoLayout in an end-to-end manner.
Another application of the present disclosure is mapping. Since the dataset used in the present disclosure consists of a continuous trajectory per scene, the predicted layouts are used for the validation split and register the maps, using raw probability values predicted by the system of the present disclosure. In the present disclosure, low-confidence predictions are filtered out using a threshold and information is aggregated using a moving average. FIG. 6 depicts registering a map for an entire scene using the generated plurality of occupancy maps to evaluate utility of the method of the present disclosure in mapping tasks according to some embodiment of the present disclosure. The results shown in FIG. 6 closely resemble the ground truth, despite never being trained on it. While there is room for improvement, this shows the efficacy of IndoLayout for potential mapping tasks despite being a RGB only model.
Timing Analysis: Table 4 provides timing analysis of IndoLayout against state-of-the-art methods. In the present disclosure, additional model statistics like model parameter count and inference speed are reported as shown in Table 4.
Method Parameters FPS
Occant (RGB) 19.86 33.1
IndoLayout 14.35 61.02
Table 4
It is observed from Table 4 that IndoLayout layout is twice as fast, with a lower memory footprint, while showing superior performance.
Failure Cases: While the present disclosure shows improvements on several fronts, certain instances are noticed where the network either incorrectly hallucinates regions correctly or fails to produce sharp outputs. One such instance is shown in FIG. 4. Further, in some instances, multiple objects are combined together suggesting that small gaps between objects can confuse the network.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined herein and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the present disclosure if they have similar elements that do not differ from the literal language of the embodiments or if they include equivalent elements with insubstantial differences from the literal language of the embodiments described herein.
The embodiments of present disclosure provide a real-time network (60FPS) that predicts amodal layouts in indoor scenes using just a RGB image. Based on analysis across multiple photorealistic indoor datasets, it is observed that efficacy of the system of the present disclosure, that leverages attention and surpasses prior work by a significant margin in quantitative and qualitative studies.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated herein by the following claims.
,CLAIMS:
1. A processor implemented method (200), comprising:
acquiring in real time (202), via one or more processors, an image associated with one or more views of one or more scenes in an indoor environment using an image capturing device mounted on a robotic device;
training (204), via the one or more hardware processors, a self-attention and adversarial learning driven network, wherein the self-attention and adversarial learning driven network comprising:
an encoder-decoder sub-network configured to receive the image associated with the one or more views of the one or more scenes in the indoor environment as an input, wherein the encoder-decoder sub-network comprises an encoder, a self-attention module, and a decoder; and
a patch-based discriminator sub-network in communication with the encoder-decoder sub-network;
generating (206), via the one or more hardware processors, a plurality of occupancy maps of the indoor environment using the trained self-attention and adversarial learning driven network;
dynamically updating (208), via the one or more hardware processors, one or more parameters of the trained self-attention and adversarial learning driven network based on an optimized loss function, wherein the optimized loss function is obtained based on a first loss type, a second loss type and a third loss type;
generating (210), via the one or more hardware processors, a plurality of updated occupancy maps of the indoor environment using the dynamically updated trained self-attention and adversarial learning driven network;
estimating (212), via the one or more hardware processors, a layout of the indoor environment based on the plurality of updated occupancy maps; and
enabling (214), via the one or more hardware processors, the patch-based discriminator sub-network to quantify correctness of estimated layout of the indoor environment with respect to a ground truth using one or more metrics.

2. The processor implemented method as claimed in claim 1, wherein the encoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is a convolution-based encoder and configured to extract a plurality of features from the image associated with the one or more views of the one or more scenes in the indoor environment.

3. The processor implemented method as claimed in claim 1, wherein the self-attention module comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network uses a single transformer block with multi-head attention and configured to project and aggregate the plurality of features across different patches in a context dependent manner.

4. The processor implemented method as claimed in claim 1, wherein the decoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is configured to classify the generated plurality of occupancy maps into one of: (a) an unknown category, (b) an occupied category, and (c) a navigable category.

5. The processor implemented method as claimed in claim 1, wherein the first loss type is weighted cross entropy loss, second loss type is boundary loss, and the third loss type is generative adversarial network (GAN) loss.

6. The processor implemented method as claimed in claim 1, wherein the estimated layout of the indoor environment is a top down bird view layout.

7. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
acquire in real time, an image associated with one or more views of one or more scenes in an indoor environment using an image capturing device mounted on a robotic device;
train, a self-attention and adversarial learning driven network, wherein the self-attention and adversarial learning driven network comprising:
an encoder-decoder sub-network configured to receive the image associated with the one or more views of the one or more scenes in the indoor environment as an input, wherein the encoder-decoder sub-network comprises an encoder, a self-attention module, and a decoder; and
a patch-based discriminator sub-network in communication with the encoder-decoder sub-network;
generate, a plurality of occupancy maps of the indoor environment using the trained self-attention and adversarial learning driven network;
dynamically update, one or more parameters of the trained self-attention and adversarial learning driven network based on an optimized loss function, wherein the optimized loss function is obtained based on a first loss type, a second loss type and a third loss type;
generate, a plurality of updated occupancy maps of the indoor environment using the dynamically updated trained self-attention and adversarial learning driven network;
estimate, a layout of the indoor environment based on the plurality of updated occupancy maps; and
enable, the patch-based discriminator sub-network to quantify correctness of estimated layout of the indoor environment with respect to a ground truth using one or more metrics.

8. The system as claimed in claim 7, wherein the encoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is a convolution-based encoder and configured to extract a plurality of features from the image associated with the one or more views of the one or more scenes in the indoor environment.

9. The system as claimed in claim 7, wherein the self-attention module comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network uses a single transformer block with multi-head attention and configured to project and aggregate the plurality of features across different patches in a context dependent manner.

10. The system as claimed in claim 7, wherein the decoder comprised in the encoder-decoder sub-network of the self-attention and adversarial learning driven network is configured to classify the generated plurality of occupancy maps into one of: (a) an unknown category, (b) an occupied category, and (c) a navigable category.

11. The system as claimed in claim 7, wherein the first loss type is weighted cross entropy loss, second loss type is boundary loss, and the third loss type is generative adversarial network (GAN) loss.

12. The system as claimed in claim 7, wherein the estimated layout of the indoor environment is a top down bird view layout.

Documents

Application Documents

#	Name	Date
1	202221059890-STATEMENT OF UNDERTAKING (FORM 3) [20-10-2022(online)].pdf	2022-10-20
2	202221059890-PROVISIONAL SPECIFICATION [20-10-2022(online)].pdf	2022-10-20
3	202221059890-FORM 1 [20-10-2022(online)].pdf	2022-10-20
4	202221059890-DRAWINGS [20-10-2022(online)].pdf	2022-10-20
5	202221059890-DECLARATION OF INVENTORSHIP (FORM 5) [20-10-2022(online)].pdf	2022-10-20
6	202221059890-FORM-26 [24-11-2022(online)].pdf	2022-11-24
7	202221059890-Proof of Right [05-01-2023(online)].pdf	2023-01-05
8	202221059890-FORM 3 [02-02-2023(online)].pdf	2023-02-02
9	202221059890-FORM 18 [02-02-2023(online)].pdf	2023-02-02
10	202221059890-ENDORSEMENT BY INVENTORS [02-02-2023(online)].pdf	2023-02-02
11	202221059890-DRAWING [02-02-2023(online)].pdf	2023-02-02
12	202221059890-COMPLETE SPECIFICATION [02-02-2023(online)].pdf	2023-02-02
13	Abstract1.jpg	2023-02-16
14	202221059890-FER.pdf	2025-07-07
15	202221059890-FORM 3 [05-09-2025(online)].pdf	2025-09-05

Search Strategy

1	202221059890_SearchStrategyNew_E_202221059890searchE_04-03-2025.pdf