Epipolar Geometry Based Learning Of Multi View Depth Estimation From

< Back

Epipolar Geometry Based Learning Of Multi View Depth Estimation From Image Sequences

Abstract: In many applications such as robot navigation, autonomous cars, depth perception to capture geometric structure of a scene plays an important role. Conventional methods do not ensure proper pixel correspondences in the images of a scene, which leads to inaccurate depth and pose estimations. The present disclosure describes systems and methods for epipolar geometry based learning of multi-view depth estimation from image sequences which provides better depth images, and capture the scene structure in a better way, along with reduced errors in pose estimation. The proposed method utilizes two deep neural networks for estimating depth and relative poses of the images. The deep neural network used for depth estimation incorporates 2-view depth estimation. The estimated depth and relative poses are utilized to loss function including a photometric loss minimized by imposing epipolar constraints and a plurality a pre-computed losses. The loss function is further used for training the deep neural networks.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

18 December 2018

Publication Number

21/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

ip@legasis.in

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. PRASAD, Vignesh

Tata Consultancy Services Limited, Building 2B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata - 700160, West Bengal, India

2. DAS, Dipanjan

3. BHOWMICK, Brojeshwar

Specification

Claims:

1. A processor implemented method, comprising:
receiving (202), by an image capturing device, a set of successive images specific to a scene comprising a plurality of source images and a plurality of target images;
estimating (204), using a first deep neural network, depth information for a target image from the plurality of target images of the scene with reference to a source image from plurality of source images, wherein the first deep neural network receives two consecutive images as input, and wherein a first input image represents the target image and a second input image represents the source image;
estimating (206), using a second deep neural network, a plurality of relative poses between (i) a set of source images from the plurality of source images and (ii) a target image;
warping (208), using the estimated depth and plurality of relative poses, the set of source images into frame of the plurality of target images to obtain a plurality of warped images;
computing (210) a first photometric loss between each pixel of the plurality of warped images and the plurality of target images, wherein the first photometric loss is computed by finding point to point correspondences between each pixel of the plurality of warped images and the plurality of target images, and wherein each warped image from the plurality of warped images represents a new image;
minimizing (212) the first photometric loss to obtain a second photometric loss by imposing one or more epipolar constraints on each warped pixel using an essential matrix that is obtained between consecutive images from the set of successive images using a five point algorithm,
wherein the second photometric loss is computed by weighing the first photometric loss between each warped pixel and each pixel of target images with a weighting parameter, and
wherein the weighting parameter is a function of epipolar loss computed between each pixel of a corresponding target image and each pixel of a corresponding warped image; and
estimating (214) a loss function, based on the computed second photometric loss and a plurality of pre-computed losses, across all the pixels of the warped image representing the new image.

2. The method as claimed in claim 1, wherein the warped pixels lie on a corresponding epipolar line when the epipolar constraints are imposed, and wherein the corresponding epipolar line of warped pixel is characterized as a projection of a three-dimensional ray corresponding to the warped pixel in the target image.

3. The method as claimed in claim 1, wherein the plurality of pre-computed losses comprise at least one of (i) a depth consistency loss, (ii) smoothness loss, and (iii) a structural similarity index loss.

4. The method of claim 3, wherein the depth consistency loss from the plurality of pre-computed losses is computed to determine consistency between depth of a target image computed with reference to a first source image from plurality of source images and at least a second source image from plurality of source images.

5. The method as claimed in claim 1, further comprising training the first deep neural network and the second deep neural network, based on the loss function and by using an Adam optimizer, until number of iterations specific to the training of the first deep neural network and the second deep neural network reaches a pre-determined value.

6. A system (100), comprising:
a memory(102);
one or more communication interfaces(104); and
one or more hardware processors (106) coupled to said memory through said one or more communication interfaces, wherein said one or more hardware processors are configured to:
receive, by an image capturing device, a set of successive images specific to a scene comprising a plurality of source images and a plurality of target images;
estimate, using a first deep neural network, depth information for a target image from the plurality of target images of the scene with reference to a source image from plurality of source images, wherein the first deep neural network receives two consecutive images as input, and wherein a first input image represents the target image and a second input image represents the source image;
estimate, using a second deep neural network, a plurality of relative poses between (i) a set of source images from the plurality of source images and (ii) a target image;
warp, using the estimated depth and plurality of relative poses, the set of source images into frame of the plurality of target images to obtain a plurality of warped images;
compute a first photometric loss between each pixel of the plurality of warped images and the plurality of target images, wherein the first photometric loss is computed by finding point to point correspondences between each pixel of the plurality of warped images and the plurality of target images, and wherein each warped image from the plurality of warped images represents a new image;
minimize the first photometric loss to obtain a second photometric loss by imposing one or more epipolar constraints on each warped pixel using an essential matrix that is obtained between consecutive images from the set of successive images using a five point algorithm,
wherein the second photometric loss is computed by weighing the first photometric loss between each pixel of the warped images and target images with a weighting parameter,
and wherein the weighting parameter is a function of epipolar loss computed between each pixel of a corresponding target image and each pixel of a corresponding warped image; and
estimate a loss function, based on the computed second photometric loss and a plurality of pre-computed losses, across all the pixels of the warped image representing the new image.

7. The system as claimed in claim 6, wherein the warped pixels lie on a corresponding epipolar line when the epipolar constraints are imposed, and wherein the corresponding epipolar line of warped pixel is characterized as a projection of a three-dimensional ray corresponding to the warped pixel in the target image.

8. The system as claimed in claim 6, wherein the plurality of pre-computed losses comprise at least one of (i) a depth consistency loss, (ii) smoothness loss, and (iii) a structural similarity index loss.

9. The system as claimed in claim 8, wherein the depth consistency loss from the plurality of pre-computed losses is computed to determine consistency between depth of a target image computed with reference to a first source image from plurality of source images and at least a second source image from plurality of source images.

10. The system as claimed in claim 6, further comprising training the first deep neural network and the second deep neural network, based on the loss function and by using an Adam optimizer, until number of iterations specific to the training of the first deep neural network and the second deep neural network reaches a pre-determined value.
, Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:

“EPIPOLAR GEOMETRY BASED LEARNING OF MULTI-VIEW DEPTH ESTIMATION FROM IMAGE SEQUENCES”

Applicant

Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
The disclosure herein generally relates to field of learning of depth estimation from image sequences, more particularly, to epipolar geometry based learning of multi-view depth estimation from image sequences.

BACKGROUND
A large number of applications, such as mobile robots, self-driving cars, robot navigation and unmanned aerial vehicles, rely on systems deriving inference of a geometric structure of a scene, where depth perception plays an important role. Traditional approaches for deriving scene geometry inference and depth perception, though efficient, require accurate point correspondences in computing the camera poses and recovering the structure. A spin-off of this problem comes under the domain of Visual Simultaneous Localization and Mapping (SLAM) or Visual Odometry (VO), which involves real-time estimation of camera poses and/or a structural 3D map of an environment. There are deep learning based methods which have gained momentum to solve the problem of dense image depth and Visual Odometry estimation. But, traditional deep learning based methods try to infer depth of images of a scene from a single view, which might not effectively capture the relation between pixels of images of a scene.
Traditional VO approaches are either sparse, semi-dense or dense. Sparse and semi dense approaches suffer from problems of improper correspondences in texture-less areas, occlusions and repeating patterns. In an image setting, sparse correspondences allow estimation of depth for corresponding points however estimating dense depth from a single image is a much more complex problem. Further, traditional approaches simply minimize the photometric loss which does not ensure proper pixel correspondences, thereby may lead to inaccurate depth and pose estimations.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a processor implemented method, comprising: receiving, by an image capturing device, a set of successive images specific to a scene comprising a plurality of source images and a plurality of target images; estimating, using a first deep neural network, depth information for a target image from the plurality of target images of the scene with reference to a source image from plurality of source images, wherein the first deep neural network receives two consecutive images as input, and wherein a first input image represents the target image and a second input image represents the source image; estimating, using a second deep neural network, a plurality of relative poses between (i) a set of source images from the plurality of source images and (ii) a target image; warping, using the estimated depth and plurality of relative poses, the set of source images into frame of the plurality of target images to obtain a plurality of warped images; computing a first photometric loss between each pixel of the plurality of warped images and the plurality of target images, wherein the first photometric loss is computed by finding point to point correspondences between each pixel of the plurality of warped images and the plurality of target images, and wherein each warped image from the plurality of warped images represents a new image; minimizing the first photometric loss to obtain a second photometric loss by imposing one or more epipolar constraints on each warped pixel using an essential matrix that is obtained between consecutive images from the set of successive images using a five point algorithm, wherein the second photometric loss is computed by weighing the first photometric loss between each warped pixel and each pixel of target images with a weighting parameter, and wherein the weighting parameter is a function of epipolar loss computed between each pixel of a corresponding target image and each pixel of a corresponding warped image.
In an embodiment, the warped pixels lie on a corresponding epipolar line when the epipolar constraints are imposed, and wherein the corresponding epipolar line of warped pixel is characterized as a projection of a three-dimensional ray corresponding to the warped pixel in the target image. In an embodiment, the method further comprising estimating a loss function, based on the computed second photometric loss and a plurality of pre-computed losses, across all the pixels of the warped image representing the new image. In an embodiment, the plurality of pre-computed losses comprise at least one of (i) a depth consistency loss, (ii) smoothness loss, and (iii) a structural similarity index loss. In an embodiment, the depth consistency loss from the plurality of pre-computed losses is computed to determine consistency between depth of a target image computed with reference to a first source image from plurality of source images and at least a second source image from plurality of source images. In an embodiment, the method further comprising training the first deep neural network and the second deep neural network, based on the loss function and by using an Adam optimizer, until number of iterations specific to the training of the first deep neural network and the second deep neural network reaches a pre-determined value.
In another aspect, there is provided a system comprising: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory through the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to receive, by an image capturing device, a set of successive images specific to a scene comprising a plurality of source images and a plurality of target images; estimate, using a first deep neural network, depth information for a target image from the plurality of target images of the scene with reference to a source image from plurality of source images, wherein the first deep neural network receives two consecutive images as input, and wherein a first input image represents the target image and a second input image represents the source image; estimate, using a second deep neural network, a plurality of relative poses between (i) a set of source images from the plurality of source images and (ii) a target image; warp, using the estimated depth and plurality of relative poses, the set of source images into frame of the plurality of target images to obtain a plurality of warped images; compute a first photometric loss between each pixel of the plurality of warped images and the plurality of target images, wherein the first photometric loss is computed by finding point to point correspondences between each pixel of the plurality of warped images and the plurality of target images, and wherein each warped image from the plurality of warped images represents a new image; minimize the first photometric loss to obtain a second photometric loss by imposing one or more epipolar constraints on each warped pixel using an essential matrix that is obtained between consecutive images from the set of successive images using a five point algorithm, wherein the second photometric loss is computed by weighing the first photometric loss between each pixel of the warped images and target images with a weighting parameter, and wherein the weighting parameter is a function of epipolar loss computed between each pixel of a corresponding target image and each pixel of a corresponding warped image.
In an embodiment, the warped pixels lie on a corresponding epipolar line when the epipolar constraints are imposed, and wherein the corresponding epipolar line of warped pixel is characterized as a projection of a three-dimensional ray corresponding to the warped pixel in the target image. In an embodiment, the one or more hardware processors are further configured to estimate a loss function, based on the computed second photometric loss and a plurality of pre-computed losses, across all the pixels of the warped image representing the new image. In an embodiment, the plurality of pre-computed losses comprise at least one of (i) a depth consistency loss, (ii) smoothness loss, and (iii) a structural similarity index loss. In an embodiment, the depth consistency loss from the plurality of pre-computed losses is computed to determine consistency between depth of a target image computed with reference to a first source image from plurality of source images and at least a second source image from plurality of source images. In an embodiment, the one or more hardware processors are further configured to train the first deep neural network and the second deep neural network, based on the loss function and by using an Adam optimizer, until number of iterations specific to the training of the first deep neural network and the second deep neural network reaches a pre-determined value.
In yet another aspect, there are provided one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause receiving, by an image capturing device, a set of successive images specific to a scene comprising a plurality of source images and a plurality of target images; estimating, using a first deep neural network, depth information for a target image from the plurality of target images of the scene with reference to a source image from plurality of source images, wherein the first deep neural network receives two consecutive images as input, and wherein a first input image represents the target image and a second input image represents the source image; estimating, using a second deep neural network, a plurality of relative poses between (i) a set of source images from the plurality of source images and (ii) a target image; warping, using the estimated depth and plurality of relative poses, the set of source images into frame of the plurality of target images to obtain a plurality of warped images; computing a first photometric loss between each pixel of the plurality of warped images and the plurality of target images, wherein the first photometric loss is computed by finding point to point correspondences between each pixel of the plurality of warped images and the plurality of target images, and wherein each warped image from the plurality of warped images represents a new image; minimizing the first photometric loss to obtain a second photometric loss by imposing one or more epipolar constraints on each warped pixel using an essential matrix that is obtained between consecutive images from the set of successive images using a five point algorithm, wherein the second photometric loss is computed by weighing the first photometric loss between each warped pixel and each pixel of target images with a weighting parameter, and wherein the weighting parameter is a function of epipolar loss computed between each pixel of a corresponding target image and each pixel of a corresponding warped image.
In an embodiment, the warped pixels lie on a corresponding epipolar line when the epipolar constraints are imposed, and wherein the corresponding epipolar line of warped pixel is characterized as a projection of a three-dimensional ray corresponding to the warped pixel in the target image. In an embodiment, the instructions may further cause estimating a loss function, based on the computed second photometric loss and a plurality of pre-computed losses, across all the pixels of the warped image representing the new image. In an embodiment, the plurality of pre-computed losses comprise at least one of (i) a depth consistency loss, (ii) smoothness loss, and (iii) a structural similarity index loss. In an embodiment, the depth consistency loss from the plurality of pre-computed losses is computed to determine consistency between depth of a target image computed with reference to a first source image from plurality of source images and at least a second source image from plurality of source images. In an embodiment, the instructions may further cause training the first deep neural network and the second deep neural network, based on the loss function and by using an Adam optimizer, until number of iterations specific to the training of the first deep neural network and the second deep neural network reaches a pre-determined value.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates a functional block diagram of a system for epipolar geometry based learning of multi-view depth estimation from image sequences, according to some embodiments of the present disclosure;
FIGS. 2A and 2B depict an exemplary flow diagram of a processor implemented method for epipolar geometry based learning of multi-view depth estimation from image sequences, in accordance with some embodiments of the present disclosure; and
FIG. 3 depicts a high level architectural diagram for epipolar geometry based learning of multi-view depth estimation from image sequences, in accordance with some embodiments of the present disclosure.
FIGS. 4A and 4B depict design of a first deep neural network and a second deep neural network, in accordance with some embodiments of the present disclosure.
FIGS. 5A through 5C depict the performance of the proposed system and traditional systems analyzed in terms of depth estimation results, in accordance with some embodiments of the present disclosure.
FIGS. 6A and 6B depict 3-view depth estimation results without and with larger convolutional filter size respectively, in accordance with some embodiments of the present disclosure.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

DETAILED DESCRIPTION OF EMBODIMENTS

Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
The embodiments herein provide systems and methods for epipolar geometry based learning of multi-view depth estimation from image sequences by constraining correspondences to lie on their corresponding epipolar lines. This is performed by weighting a plurality of losses using epipolar constraints with an Essential Matrix obtained from a Five Point Algorithm. The Five Point Algorithm provides guidance for training and improving estimations of depth of a scene from image sequences. The proposed method utilizes two deep neural network, wherein one deep neural network (also referred as depth network) is used for estimating depth of images of a scene and other deep neural network (also referred as pose network) is used for determining relative poses between images. The depth network takes a pair of consecutive images as input, rather than a single image, and calculates the depth of the scene as seen in the first image. The pose network takes an image sequence as input. The first image is the target image with respect to which the poses of other images are calculated. Both networks are independent of each other and are trained jointly to effectively capture the coupling between scene depth and camera motion in a learning based paradigm. The proposed method incorporates two view depth estimation which improves the depth estimation. Further, the proposed method incorporates epipolar constraints to make the learning more geometrically oriented by using per-pixel epipolar distance as a weighting factor to help deal with occlusions and non-rigid objects. The epipolar constraints enforce pixel level correspondences and remove a need to predict motion masks to discount regions undergoing motion, occluded areas, or other factors that could affect the learning, thereby having less number of parameters to estimate. The proposed method helps in case of violations of static scene assumption, and tackle with problem of improper correspondence generation that arises by minimizing only a first photometric loss.
Referring now to the drawings, and more particularly to FIGS. 1 through 6B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates a functional block diagram of a system for epipolar geometry based learning of multi-view depth estimation from image sequences, according to some embodiments of the present disclosure. The system 100 includes or is otherwise in communication with one or more hardware processors such as a processor 106, an I/O interface 104, at least one memory such as a memory 102, comprising an image processing module 108. In an embodiment, the image processing module 108 can be implemented as a standalone unit in the system 100. In another embodiment, the image processing module 108 can be implemented as a module in the memory 102. The processor 106, the I/O interface 104, and the memory 102, may be coupled by a system bus.
The I/O interface 104 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The interfaces 104 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a camera device, and a printer. The interfaces 104 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the interfaces 104 may include one or more ports for connecting a number of computing systems with one another or to another server computer. The I/O interface 104 may include one or more ports for connecting a number of devices to one another or to another server.
The hardware processor 106 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the hardware processor 106 is configured to fetch and execute computer-readable instructions stored in the memory 102.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102 includes the image processing module 108 and a repository 110 for storing data processed, received, and generated by the image processing module 108. All the information and computation performed or processed by the image processing module 108 are stored in the memory of the system 100 and are invoked as applicable and required. The image processing module 108 may include routines, programs, objects, components, data structures, and so on, which perform particular tasks or implement particular abstract data types.
The data repository 110, amongst other things, includes a system database and other data. The other data may include data generated as a result of the execution of the image processing module 108. The system database is dynamically updated by further learning the data already stored in it.
In an embodiment, the image processing module 108 can be configured to perform epipolar geometry based learning of multi-view depth estimation from image sequences. Epipolar geometry based learning of multi-view depth estimation from image sequences can be carried out by using methodology, described in conjunction with FIGS. 2A through 4B and using examples.
FIGS. 2A and 2B, with reference to FIG. 1, depict an exemplary flow diagram of a processor implemented method for epipolar geometry based learning of multi-view depth estimation from image sequences using the image processing module 108 of FIG. 1, in accordance with some embodiments of the present disclosure. In an embodiment of the present disclosure, at step 202 of FIG. 2A, the one or more hardware processors 106 receive, via an image capturing device, a set of successive images specific to a scene comprising a plurality of source images and a plurality of target images. In an embodiment, the image capturing device may include but not limited to a camera, a drone, a camcorder, a video recorder, a mobile phone, a satellite and the like. In an embodiment, the set of successive images is interchangeably referred as image sequence or sequence of images. In an embodiment, the image sequences may include but not limited to a monocular image sequence, a stereo image sequence, and the like.
Further, at step 204 of FIG. 2A, the one or more hardware processors 106 are configured to estimate depth information for a target image from the plurality of target images of the scene with reference to a source image from plurality of source images. In an embodiment, the depth information is estimated by using a first deep neural network (alternatively referred as depth network). In an embodiment, the first deep neural network may include but not limited to a convolutional neural network (CNN), a recurrent neural network (RNN), and the like. FIG. 3 depicts a high level architectural diagram for epipolar geometry based learning of multi-view depth estimation from image sequences, in accordance with some embodiments of the present disclosure. As depicted in FIG. 3, the first deep neural network receives two consecutive images as input. Here, first input image from the two consecutive input images represents the target image and second input image from two consecutive input images represents the source image. In other words, the first deep neural network takes two images as input (e.g., a pair of RGB images concatenated along the colour channel (H×W×6)) and outputs the depth of the first image. In an embodiment, the reason for considering two consecutive images as input to the first deep neural network, instead of a single image, is to leverage the relationship between pixels over multiple images to calculate the depth, rather than relying on the plurality of learnt semantics of the scene in a single image. In another embodiment, by learning a plurality of semantic artefacts in the scene from multiple view, the network can learn inter-pixel relationships and correspondences, in a similar way as optical flow networks are modelled. In an iteration, depths are estimated with each of the S source images as the second input.
FIG. 4A depicts design of the first deep neural network, in accordance with some embodiments of the present disclosure. As depicted in FIG. 4A, the first deep neural network, comprises of a convolutional-deconvolutional encoder-decoder network with skip connections from previous layers. Here, depth estimation is performed at 4 different scales levels. The output at each scale level is upsampled and concatenated to a deconv layer for the next scale. First 4 layers have kernel sizes 7, 7, 5, 5 respectively and remaining layers have a kernel of size 3. The number of output channels for first layer is 32 and increases by a factor of 2 after each layer until it reaches 512 following which it stays the same. The decoder uses a sequence of following layers, wherein first a convolution is performed followed by a convolution of the concatenation of current layer with corresponding layer in the encoder. This is performed for first 4 deconv-conv sequences after which output depth estimation gets upsampled and concatenated. First two deconv-conv sequences have a 512 output channels which gets reduced by a factor of 2 for each subsequent sequence. Output layers are single channel convolutional layers with a kernel size 3 and stride 1. The strides alternate between 2 and 1 for non-output layers in the whole network. For all the layers except the output layers, Rectified Linear Unit (ReLU) activations are used. For the output layers, sigmoid functions of the form Ss(x)+? is used. Here S is a scaling factor the value of which is kept as 10 to keep the output in a reasonable range, and ? is an offset which is maintained at a value 0.01 to ensure positive non-zero outputs. Further, estimated depth is scaled to have unit mean by applying depth normalization. In an embodiment, normalization is performed to remove any scale ambiguity in the estimated depths.
Referring back to FIG. 2A, at step 206, the one or more hardware processors 106 are configured to estimate, using a second deep neural network, a plurality of relative poses between (i) a set of source images from the plurality of source images and (ii) a target image. In an embodiment, the second deep neural network is interchangeably referred as a pose network. The second deep neural network (interchangeably referred as pose network) takes an image sequence as input. First image comprised in the image sequence is the target image with respect to which the relative poses of other images from the image sequence are calculated. The other images from the image sequence are referred as the set of source images.
FIG. 4B depicts design of the second deep neural network, in accordance with some embodiments of the present disclosure. As depicted in FIG. 4B, the second deep neural network comprises of 7 convolutional layers with ReLU activation followed by a single stride output layer with no activation. All layers have a kernel size of 3 except first 2 layers having kernel sizes of 7 and 5 respectively. Number of output channels of first layer is 16 and increases by a factor of 2. Finally, global average pooling is applied to network output for aggregating depth estimations at all spatial locations. In an embodiment, for the second deep neural network the target image and the source images are concatenated along the colour channel giving rise to an input layer of size H×W×3N, where N is number of input images. The second deep neural network predicts 6 Degrees of Freedom (DoF) poses for each of the N - 1 source images relative to the target image. In an embodiment, both deep neural networks depicted in FIG. 3 (in this case, the first deep neural network and the second deep neural network), are independent of each other and are trained jointly to effectively capture the coupling between scene depth and camera motion in learning based paradigm.
Referring back to FIG. 2A, at step 208, the one or more hardware processors 106 are configured to warp, using the estimated depth and plurality of relative poses, the set of source images into frame of the plurality of target images to obtain a plurality of warped images. In an embodiment, for warping a source image into frame of a corresponding target image, normalized coordinates of each pixel of the source image are considered. Further, a pixel of source image (denoted as p) in normalized coordinates along with the estimated depth information, is transformed into a source frame using the relative poses. Transformed image is further projected on to the source image’s plane to obtain a warped pixel using equation 1 provided below as:
p ^=K(R_(t?s) D(p) K^(-1) p + t_(t?s) ) (1)
In above equation (2), p ^ denotes the warped pixel, K denotes an intrinsic calibration matrix of the camera, D(p) denotes estimated depth of a target pixel with a corresponding source pixel. R_(t?s) and t_(t?s) denote rotation and translation respectively from target frame to the source frame. In other words, a warped pixel is taken from the source image by applying transformations using relative poses. The transformations provide the pixel coordinates in the source image which correspond to pixels in the target image. In an embodiment, the homogeneous coordinates of the warped pixels are continuous. However, integer values of the warped pixels are required. Thus, for converting the value of continuous homogenous coordinates of the warped pixels into integer form, values of the warped pixel are interpolated from corresponding nearby pixels, using an existing bi-linear sampling method. In an embodiment, the above procedure for warping is applied on all the pixels of the target image and the corresponding source image to obtain a resultant warped image.
Referring back to FIG. 2A, at step 210, the one or more processors are configured to compute a first photometric loss between each pixel of the plurality of warped images and the plurality of target images. In an embodiment, a photometric loss is defined as sum of pixel intensity difference between two images. In an embodiment, the first photometric loss is computed by finding point to point correspondences between each pixel of the plurality of warped images and the plurality of target images. In an embodiment, each warped image from the plurality of warped images represents a new image. In an embodiment, the first photometric loss, for a target image and a plurality of warped images, is computed based on equation 2 provided below as:
L_1warp= 1/N ?_(s=0)^S¦?_(p=0)^N¦|I_t (p)-I ^_s (p)| (2)
In above equation (3), It denotes the target image, I ^_s denotes the warped image, L_1warp denotes the first photometric loss, N denotes total number of pixels of target image, and p denotes a pixel. In an embodiment, the first photometric loss does not take any ambiguous pixels into consideration such as the pixels belonging to image of non-rigid objects, the pixels which are occluded and the like. Thus, the first photometric error needs to be minimized by weighting the pixels of images based on whether they are properly projected or not. In an embodiment, way of ensuring correct projection is by checking if the pixel of image satisfies one or more epipolar constraints or not.
Further, as depicted at step 212 of FIG. 2B, the one or more processors are configured to minimize the first photometric loss to obtain a second photometric loss by imposing one or more epipolar constraints on each warped pixel using an essential matrix. In an embodiment, the warped pixels lie on a corresponding epipolar line when the epipolar constraints are imposed. In an embodiment, the corresponding epipolar line of the warped pixel is characterized as a projection of a three-dimensional ray corresponding to the warped pixel in the target image. It is known in existing literature that a pixel p in an image corresponds to a ray in 3D, which is given by its normalized coordinates? K?^(-1) p, where K is intrinsic calibration matrix of the camera. Also, it is known that from a second view, image of first camera center is called an epipole and that of a ray is called an epipolar line. In such cases, the one or epipolar constraints implies that corresponding pixel of the image in the second view lie on the corresponding epipolar line. In another embodiment, the one or more epipolar constraints on each pixel of the plurality of warped images ensures that the warped pixels lie on their corresponding epipolar line. In an embodiment, the second photometric loss is computed by weighing the first photometric loss between each warped pixel and each pixel of the target images with a weighting parameter. In an embodiment, the weighting parameter is a function of epipolar loss computed between each pixel of a corresponding target image and each pixel of a corresponding warped image. In an embodiment, the epipolar loss is expressed mathematically by equation 3 provided below as:
p ~ ^^T Ep ~=0 (3)
Here, p ~ denotes normalized coordinates of a pixel p, p ~ ^ denotes normalized coordinates of pixel p in the second view (also referred as warped pixel), E denotes the essential matrix, and Ep ~ depicts the epipolar line in the second view corresponding to a pixel p in the first view. In another embodiment, the epipolar loss is referred as an error occurred in capturing the pixel p, finding the corresponding warped pixel p ^ or in estimating the Essential Matrix E. In an embodiment, the essential matrix is estimated between consecutive images from the set of successive images using a five point algorithm. In an embodiment, the five point algorithm proposes a solution to a problem of identifying possible configurations of points and cameras when projections of five unknown points onto two unknown views are provided. The five point algorithm recovers absolute scale images by solving a tenth degree polynomial in order to extract the Essential Matrix E which can then be decomposed into the rotation R and translation t between the two views. In an embodiment, the second photometric loss is expressed mathematically by equation 4 provided below as:
L_2warp= 1/N ?_(s=0)^S¦?_(p=0)^N¦?|I_t (p)-I ^_s (p)| e^|p ~ ^^T Ep ~ | ? (4)
In an embodiment, the second photometric loss is calculated due to a reason that for a non-rigid object, even if the pixel of image of the object is properly projected, the first photometric error would be high. In order to ensure that such pixels are given a low weight, the pixels of image of the object are weighted with their epipolar distance, which would be low if a pixel is properly projected. If the epipolar loss is high, it means that the projection is wrong, giving a high weight to the first photometric loss, thereby increasing its overall penalty. Further, the second photometric loss is computed to help in mitigating the problem of a pixel getting projected to a region of similar intensity by constraining it to lie along the epipolar line.
Referring back to FIG.2B, at step 214, one or more hardware processors are configure to estimate a loss function, based on the computed second photometric loss and a plurality of pre-computed losses, across all the pixels of the warped image representing the new image. In an embodiment, the plurality of pre-computed losses comprise at least one of (i) a depth consistency loss, (ii) smoothness loss, and (iii) a structural similarity index loss. In an embodiment, the depth consistency loss from the plurality of pre-computed losses is computed to determine consistency between depth of a target image computed with reference to a first source image from plurality of source images and at least a second source image from plurality of source images. In an embodiment, computation of depth consistency loss is explained with the help of an example below. For example, it is assumed that a set of successive images received from image capturing devices comprises three consecutive image (say Image1, Image 2, Image 3). Further, middle image (e.g., Image 2) amongst the three consecutive images is considered to be the target image. Since, the first deep neural network (Alternatively referred as depth network) takes only two consecutive images as input only, thus the remaining two images (Image1 and Image 2) from the three consecutive images are considered as source images. Further, the depth of target image (In this case middle image, e.g., Image 2) is estimated with both the source images as the second input, one by one. For instance, the depth of target image (In this case middle image, e.g., Image 2) estimated with first source image (e.g., Image 1) is referred as depth 1 and the depth of target image (in this case middle image, e.g., Image 2) estimated with second source image (e.g., Image 3) is referred as depth 2. Since all the estimated depths (e.g., depth 1 and depth 2) are for the target image (In this case middle image, e.g., Image 3) itself, the estimated depths (e.g., depth 1 and depth 2) need to be consistent with one another. In an embodiment, a deviation in the value of the estimated depths (e.g., depth 1 and depth 2) is referred as depth consistency loss. In an embodiment, the depth consistency loss is represented by equation 5 provided below as:
L_depth= 1/N ?_(i=0)^S¦?_(j=i+1)^S¦?_p¦|D_i (p)-D_j (p)| (5)
In the above equation, N denotes total number of pixels of the target image, p denotes a pixel of an image, D_i (p) denotes depth of each pixel of target image estimated with each pixel of ith source Image (e.g., Image 1), and D_j (p) denotes depth of each pixel target image estimated with each pixel of jth source Image (e.g., Image 3). Here, ith source image is different from jth source image. In an embodiment, for maintaining the consistency of the estimated depths of target image with reference to multiple source images, the depth consistency loss should be minimized.
In an embodiment, the smoothness loss from the plurality of pre-computed losses is computed after estimation of depth for a target image from the plurality of target images of the scene with reference to a source image from plurality of source images. In an embodiment, the smoothness loss is computed to determine one or more depth discontinuities (e.g., sudden changes in depth) due to crossing of object boundaries and ensure a smooth change in depth values. The smoothness loss is minimized by minimizing L1 norm of the 2nd order spatial gradients of the estimated depth of a pixel of target image denoted by ?^2 d(p) and weighted by the image laplacian at that pixel of target image denoted by ?^2 I(p). In an embodiment, the smoothness loss is represented by equation (6) provided below as:
L_smoooth= 1/N ?_(p=0)^N¦?_(i?{x,y})¦?_(j?{x,y})¦|?_ij d(p)| e^(-|?_ij I(p)| ) (6)
In an embodiment, the structural similarity index loss (SSIM loss) from the plurality of pre-computed losses provides a metric to determine similarity between two images of a scene (interchangeably referred as image similarity). Structural similarity index (SSIM) considers three main factors, namely luminance, contrast and structure, to provide a measure for image similarity. Since structural similarity index (SSIM) needs to be maximized (with 1 as the maximum value), the structural similarity index loss (SSIM loss) is minimized. In an embodiment, the structural similarity index loss (SSIM loss) is expressed mathematically by equation (7) provided below as:
L_ssim= ?_s¦(1-SSIM(I_t,I ^_s ))/2 (7)
Here, I_t denotes a target image, I ^_s denotes a warped image, and L_ssim denotes the structural similarity index loss (SSIM loss).
In an embodiment, the loss function estimated based on the computed second photometric loss and the plurality of pre-computed losses is expressed mathematically by equation (8) provided below as:
L=?_l¦(L_2warp^l+ ?_smooth L_smooth^l+?_ssim L_ssim^l+?_depth L_depth^l ) (8)
Here, L denotes the loss function, L_2warp^l denotes the second photometric loss, L_smooth^l denotes the smoothness loss, L_ssim^l denotes the structural similarity index loss (SSIM loss), L_depth^l denotes the depth consistency loss, and where l iterates over different scale values and ?_smooth, ?_ssim, and ?_depth are relative weights for the smoothness loss, SSIM loss and the depth consistency loss respectively.
In an embodiment, the one or more hardware processors are further configures to train the first deep neural network and the second deep neural network, based on the loss function and by using an Adam optimizer, until number of iterations specific to the training of the first deep neural network and the second deep neural network reaches a pre-determined value. In an embodiment, predetermined value of number of iteration is 20 epochs. In an embodiment, for training the first deep neural network and the second deep neural network a Tensorflow is used. Further, a batch normalization is used for the non-output layers and an Adam Optimizer with ß1 = 0.9, ß2 = 0.999, a learning rate of 0.0002 and a mini-batch of size 4 are used for training the first deep neural network. The values of the weights are set as ?_smooth = 0.2, ?_ssim= 0.7 and ?_depth = 0.5. The learning typically converges after 26 epochs. In an embodiment, for training the first deep neural network and the second deep neural network, raw images from a known training dataset namely KITTI dataset, with the split given and having about 40K images, are used. Further, static scenes and test image sequences from the known training dataset are excluded to provide 33K images from 40K images.
The illustrated steps of method 200 is set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development may change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation.
Experimental Results:
The performance of the proposed system and traditional systems is analyzed in terms of depth estimation results, pose estimation results, an Ablative study on the effect of different losses in depth estimation based results, and 3-view depth estimation results. Simulations results shown in Table 1 provide a comparison of traditional systems with proposed system in terms of depth estimation.
System
Supervision
Error Metric (lower is better)

Accuracy Metric (higher is better)

Abs. Rel. Sq. Rel. RMSE RMSE log d < 1.25 d < 1.252 d < 1.253
Traditional System 1 – 0.403 5.53 8.709 0.403 0.593 0.776 0.878
Traditional System 2(Coarse) Depth 0.214 1.605 6.563 0.292 0.673 0.884 0.957
Traditional System 2 (Fine) Depth 0.203 1.548 6.307 0.282 0.702 0.89 0.958
Traditional System 3 Depth 0.202 1.614 6.523 0.275 0.678 0.895 0.965
Traditional System 4 Stereo 0.148 1.344 5.927 0.247 0.803 0.922 0.964
Traditional System 5
Traditional System 6
Traditional System 7 (w/o explain-ability)
Traditional System 7
Traditional System 7 (updated from github) Mono

Mono

Mono 0.308

0.182

0.221

0.208

0.183 9.367

1.481

2.226

1.768

1.595 8.7

6.501

7.527

6.856

6.709 0.367

0.267

0.294

0.283

0.27 0.752

0.725

0.676

0.678

0.734 0.904

0.906

0.885

0.902 0.952

0.963

0.954

0.957

0.959
System of present disclosure Mono 0.175 1.675 6.378 0.255 0.76 0.916 0.966
Traditional System 8 Stereo 0.169 1.08 5.104 0.273 0.74 0.904 0.962
Traditional System (w/o explain-ability)
Traditional System 7 Mono

Mono 0.208

0.201 1.551

1.391 5.452

5.181 0.273

0.264 0.695

0.696 0.9

0.9 0.964

0.966
System of present disclosure Mono 0.166 1.213 4.812 0.239 0.777 0.928 0.972
Table 1
The performance of system of present disclosure and traditional systems was evaluated on 697 images. For the first four rows of the table 1, the estimated depth is 80 m and for last three rows of the table 1, the estimated depth is 50 m. For a given estimated depth denoted by y ^_i and corresponding ground truth depth denoted by y_i^* for ith image, error metrics and accuracy metrics are provided. Error metric is further measured in terms of Absolute Relative Difference (Abs. Rel.), Squared Absolute Relative Difference (Sq. Rel.), Root Mean Squared Error (RMSE), and Logarithmic RMSE (RMSE log). In an embodiment, Absolute Relative Difference (Abs. Rel.), Squared Absolute Relative Difference (Sq. Rel.), Root Mean Squared Error (RMSE), and Logarithmic RMSE (RMSE log) are expressed with following mathematical expressions:
Absolute Relative Difference (Abs. Rel.): 1/N ?_(i=1)^N¦|y ^_i-y_i^* |/(y_i^* )

Squared Absolute Relative Difference (Sq. Rel.): 1/N ?_(i=1)^N¦?y ^_t-y_i^* ?^2/(y_i^* )

Root Mean Squared Error (RMSE): v(1/N ?_(i=1)^N¦?y ^_t-y_i^* ?^2 )

Logarithmic RMSE (RMSE log): v(1/N ?_(i=1)^N¦?log??y ^_t ?-log??y_i^* ? ?^2 )

However, accuracy metric is calculated as the percentage of images for which the value of d, expressed with followed mathematical expression as:
Accuracy metrics: d=max??(y ^_i/(y_i^* ),(y_i^*)/y ^_i )?
is lesser than a threshold. In the proposed disclosure, three values of threshold are selected which are 1.25, 1.252 and 1.253. As can be seen from Table 1, the proposed system performs better than the traditional systems which are purely image, traditional systems using depth supervision and those traditional systems which use calibrated stereo supervision.
FIGS. 5A through 5C depict the performance of the system 100 of the present disclosure and traditional systems analyzed in terms of depth estimation results, in accordance with some embodiments of the present disclosure. As depicted in FIGS. 5A through 5C, three images and their ground truth images are provided. The first image as depicted in FIG. 5A, depicts a scene captured in large open spaces. The second image as depicted in FIG. 5B, depicts a scene comprising texture-less regions. The third image as depicted in FIG. 5C, depicts a scene captured when objects are present right in front of the camera. The ground truth image is interpolated from sparse measurements for visualization purposes. Further, from FIGS. 5A through 5C, it is evident that the system 100 of the present disclosure performs better and provides more meaningful depth estimates in comparison to traditional systems. This shows the effectiveness of having 2-view depth estimation and using epipolar geometry to handle occlusions and non-rigidity. The system 100 provides sharper outputs as compared to traditional systems, which can be seen in the results. The proposed system scales the depth estimations such that it matches the median of the ground truth.
Simulations results shown in Table 2 provide a comparison of traditional systems with proposed system in terms of pose estimation.
Seq
Average Trajectory Error (m)
Average Translational Direction Error(radian)

Traditional System A System of the present disclosure Traditional system B System of the present disclosure
0 0.5099±0.2471 0.4967±0.1787 0.0084±0.0821 0.0040±0.0155
1 1.2290±0.2518 1.1458±0.2175 0.0061±0.0807 0.0033±0.0077
2 0.6330±0.2328 0.6512±0.1806 0.0035±0.0509 0.0021±0.0026
3 0.3767±0.1527 0.3583±0.1254 0.0142±0.1611 0.0027±0.0042
4 0.4869±0.0537 0.6404±0.0607 0.0182±0.2131 0.0002±0.0007
5 0.5013±0.2564 0.4930±0.1974 0.0130±0.0945 0.0044±0.0044
6 0.5027±0.2605 0.5384±0.1627 0.0130±0.1591 0.0080±0.0688
7 0.4337±0.3254 0.4032±0.2380 0.0508±0.2453 0.0114±0.0430
8 0.4824±0.2396 0.4708±0.1827 0.0091±0.0646 0.0037±0.0058
9 0.6652±0.2863 0.6280±0.2028 0.0204±0.1722 0.0073±0.0211
10 0.4672±0.2398 0.4185±0.1791 0.0200±0.1241 0.0040±0.0105
Table 2
In an embodiment, for pose estimation experiments, the system 100 utilizes a known KITTI Visual Odometry Benchmark dataset. The result of system 100 are shown for only 11 sequences (00-10) which are identified to have the associated ground truth. A sequence of 3 images is used as the input to the second deep neural network with the middle view as the target view. Each image is of size 1271×376 which is scaled down to a size of 416×128 for both pose estimation and depth estimation experiments. The pose estimation results are provided in the form of Average Trajectory Error (ATE) and Average Translational Direction Error (ATDE) averaged over 3 frame intervals. As can be seen from Table 2, it is observed that the system 100 performs better than Traditional System A on an average in terms of the ATE showing that adding meaningful geometric constraints helps to get better estimates as compared to minimizing just the reprojection error. The proposed system 100 performs better on all runs compared to the relative poses obtained from Traditional System B in terms of the ATDE. This rises from the fact that, the proposed system has additional constraints of depth and image warping that help in giving a better estimation of the direction of motion, whereas Traditional System B uses only sparse point correspondences between the images. Moreover, Traditional System B in itself is slightly erroneous due to inaccuracies arising in the feature matching or in the Random Sample Consensus (RANSAC) based estimation of the essential matrix. Despite being given slightly erroneous estimates of the essential matrix, incorporating image reconstruction as the main goal helps in overcoming erroneous estimations.
In an embodiment, an ablation study is performed to analyze the effect of different losses in depth estimation by considering a plurality of variants of the method of the present disclosure. The plurality of variants of the method of the present disclosure includes method of the present disclosure (no epi), method of the present disclosure (single view), method of the present disclosure (1st order) and method of the present disclosure (final). In an embodiment, the method of the present disclosure (no epi) variant is obtained by modifying the method of the present disclosure by removing epipolar loss. Further, the method of the present disclosure (1st order) variant is obtained by replacing second order edge-aware smoothness with a first order edge based smoothness. In a similar way, the method of the present disclosure (single view) variant is obtained by training the first deep neural network (Alternatively referred as depth network) with a single view depth having epipolar constraints. The results of studying the effects of epipolar constraints, 2-view depth estimation and the second order edge-based smoothness are shown in Table 3.
Method Error Metric (lower is better) Accuracy Metric (higher is better)
Abs. Rel. Sq. Rel. RMSE RMSE log d < 1.25 d < 1.252 d < 1.253
Method of traditional System 7 (w/o explainability) 0.221 2.226 7.527 0.294 0.676 0.885 0.954
Method of the present disclosure (no-epi) 0.199 1.548 6.314 0.274 0.697 0.901 0.964
Method of the present disclosure (single-view) 0.181 1.52 6.08 0.26 0.747 0.914 0.965
Method of the present disclosure (1st order) 0.19 1.44 6.144 0.269 0.714 0.906 0.965
Method of the present disclosure (final) 0.175 1.675 6.378 0.255 0.761 0.916 0.966

As can be seen from Table 3, method of traditional system 7 without the explainability mask, is considered as a baseline. As depicted in Table 3, all the variants of the method of the present disclosure perform significantly better than standard method of traditional system 7 without the explainability mask showing that the method of the present disclosure has a positive effect on the learning. Further, it can be seen from Table 3 that method of the present disclosure using 2-view depth estimation without epipolar constraints gives a significant improvement showing that using multiple views provide better depth estimates, compared to a single view. In an embodiment, incorporating only the epipolar loss with single view as provided in method of the present disclosure (single-view) variant improves the output drastically, showing that geometrically meaningful constraints provide better depth outputs. In an embodiment, the combination of any two approaches depicted in Table 3 does not lead to a drastic improvement over the individual methods. However providing a second order smoothness improves the results as compared to a first order smoothness. A first order smoothness implies having a constant depth change, which is not necessarily true. For parts of the image that are closer to the camera, the depth varies slowly, whereas for those which are farther away, the depth variations are larger. Therefore, having a second order smoothness captures this change in depth variation rather than just the change in depth.
In an embodiment, a 3-views depth estimation approach is performed and compared with the 2-view depth estimation approach as proposed in the present disclosure. Though 3-views depth estimation was giving visually understandable results, the 2-view depth estimation provides better depth estimates. Further, it is observed that during turning, the depth outputs deteriorate at a much larger scale with the 3-views depth estimation approach. In an embodiment, in comparison to motion between 2 view, 3-view shows a larger amount of motion, which results in lesser amount of overlap between images. Though the motion is not too large, it is large enough to escape the field of view of the convolutional filter in an input, since the views are stacked together and given to the deep neural networks. Thus, the present system proposing 2-view depth estimation approach performs better than 3-views depth estimation approach.
FIG. 6A depicts the 3-view depth estimation results, in accordance with some embodiments of the present disclosure. As depicted in FIG. 6A, though the coarse structure of the scene is detected, objects are still not as finely reconstructed. The third image is an example where the car is taking a turn, which gives haphazard depth outputs. In an embodiment, In order to test hypothesis of system proposed in the present disclosure regarding the filter sizes, 3 view depth estimation is performed with increasing the convolutional filter sizes. More specifically, each of the convolutional filter sizes are increased by a value of 4. In an embodiment, increased convolutional filter sizes helps in increasing the perceptive field of view of a filter thus allowing it to accumulate information from a larger area of pixels. This way, it would be able to properly "see" a pixel across multiple views, which would otherwise fall outside the field of view of a filter. As depicted in FIG. 6B, 3-view depth estimation with larger convolutional filter sizes leads to smoother depth images as compared to 3-view depth estimation with smaller filter sizes. However, the 2-view depth estimation approach as proposed in the present disclosure performs better than 3-view depth estimation with larger convolutional filter sizes. Also, the 3-view depth estimation with larger convolutional filter sizes develops a few unwanted holes.
Hence the present disclosure provides an unsupervised method for learning deep image visual odometry and depth by leveraging the fact that depth estimation could be made more robust by using multiple views rather than a single view. Along with this, the present disclosure incorporate epipolar constraints to help make the learning more geometrically meaningful while using lesser number of trainable parameters. The proposed system of the present disclosure estimates depth with higher accuracy along with giving sharper depth estimates and better pose estimates. Although increasing the number of inputs for depth estimation gave a good output in 2 views, the 3-view counterpart of the proposed system does not perform well. This would be an interesting problem to look into for improving it, either by architectural changes in the depth network or by incorporating a post-processing optimization on top of the deep neural networks. In an embodiment, the second deep neural network (utilized in the proposed system of the present disclosure) is modified by removing an "explainability mask" in the pose network proposed by the existing traditional system. Thus, the proposed method provides better performance with learning of lesser number of parameters. In an embodiment, the "explainability mask" refers to a motion mask used to discount regions undergoing motion, occluded areas, or other factors that could affect the learning. In other words, lesser number of trainable parameters are used in the proposed system, by removing the need to predict motion masks. The proposed system only performs pixel level inferences. There is a scope of obtaining a higher scene level understanding which can be obtained by integrating semantics of the scene to get better correlation between objects in the scene and the depth and ego-motion estimates. Further, there is a scope for architectural changes which could also be leveraged to get a stronger coupling between depth and pose by having a single deep neural network estimating both pose and depth in order to allow the deep neural network to be able to learn representations that capture the complex relation between both camera motion and scene depth. In an embodiment, the pose network utilized in the proposed system is a modification of a pose network proposed by an existing traditional system.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being vehicle1ted by the following claims.

Documents

Application Documents

#	Name	Date
1	201821047954-STATEMENT OF UNDERTAKING (FORM 3) [18-12-2018(online)].pdf	2018-12-18
2	201821047954-FORM 1 [18-12-2018(online)].pdf	2018-12-18
3	201821047954-FIGURE OF ABSTRACT [18-12-2018(online)].jpg	2018-12-18
4	201821047954-DRAWINGS [18-12-2018(online)].pdf	2018-12-18
5	201821047954-COMPLETE SPECIFICATION [18-12-2018(online)].pdf	2018-12-18
6	Abstract1.jpg	2019-02-20