Method And System For Integrating Objects Into Monocular Slam Using

Method And System For Integrating Objects Into Monocular Slam Using Line Based Parameterization

Abstract: The integration of objects into monocular SLAM has been a well-researched topic. The existing SLAM systems rely on the pose graph / factor graph optimization. A method and system for integrating objects into monocular SLAM using line based parameterization has been provided. The disclosure describes line parameterization for category specific models and integrates objects of 3D scene into monocular SLAM. The input image containing objects are given to Render for CNN for initializing the viewpoint of objects. Further, the image is presented to a YOLO detector, for detecting bounding boxes on objects of interest. Thereafter, line segments within the YOLO bounding boxes of objects of interest are estimated. Thereafter, a shape-pose optimization using CERES optimizer is done. This output is used to estimate the trajectory of SLAM and object location for different monocular sequences.

Patent Information

Application #

Filing Date

18 December 2018

Publication Number

25/2020

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2023-11-08

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India

Inventors

1. BHOWMICK, Brojeshwar

Tata Consultancy Services Limited, Building 1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas,Kolkata 700160, West Bengal

2. KHAWAD, Rishabh

Tata Consultancy Services Limited, Building 1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata 700160, West Bengal

3. JOSHI, Nayan

International Institute of Information Technology, (IIIT-H), Gachibowli, Near Gachibowli Stadium, Hyderabad 500032, Telangana

4. SHARMA, Yogesh

International Institute of Information Technology, (IIIT-H), Gachibowli, Near Gachibowli Stadium, Hyderabad 500032, Telangana

5. KRISHNA, Madhava

International Institute of Information Technology, (IIIT-H), Gachibowli, Near Gachibowli Stadium, Hyderabad 500032, Telangana

6. PARKHIYA, Parv

International Institute of Information Technology (IIIT-H), Gachibowli, Near Gachibowli Stadium, Hyderabad 500032, Telangana

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR INTEGRATING OBJECTS INTO MONOCULAR SLAM USING LINE BASED PARAMETERIZATION
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
[001] The embodiments herein generally relates to the field of Simultaneous Localization and Mapping (SLAM). More particularly, but not specifically, the invention provides a system and method for integrating objects into monocular SLAM using line based parameterization.
BACKGROUND
[002] In recent years, with the further development of computer technology, digital image processing technology and image processing hardware, computer vision in the field of robotics has also got the attention. Simultaneous Localization and Mapping (SLAM) is the most vital cog in various mobile robotic applications involving ground robots, aerial and under water vehicles. Monocular SLAM has emerged as a popular choice given its light weight and easy portability, especially in restrictive payload systems such as micro aerial vehicles (MAV) and hand held camera platforms. Real-time monocular SLAM has increasingly become a popular research topic.
[003] SLAM has evolved in various flavors such as active SLAM, wherein planning is interleaved with SLAM, dynamic SLAM which reconstructs moving objects and robust SLAM. Object SLAM is a relatively new paradigm wherein SLAM information is augmented with objects in the form of its poses to achieve more semantically meaningful maps with the eventful objective of improving the accuracy of SLAM systems. Object SLAM presents itself in two popular threads. In first, instance specific models are assumed to be known a priori. In the second, a general model for an object is used such as ellipsoids and cuboids. Relying on instance level models for various objects in the scene makes the first theme difficult to scale to various objects in the scene whereas general models such as cuboids do not provide meaningful information at the level of object parts and limit its relevance in application that require grasping and handling objects.

[004] To overcome such limitations, a research group positioned their research as one that combines the benefits of both. In particular, category specific models were developed in lieu of instance level models, which retained the semantic potential of the former along with the generic nature of the later at the level of object category. However, reliance of this on a key point trained network for a particular category limits its expressive power as every new object category entails the estimation of a new network model for that category along with the concomitant issues of annotation, GPU requirement and dataset preparation.
SUMMARY
[005] The following presents a simplified summary of some embodiments of the disclosure in order to provide a basic understanding of the embodiments. This summary is not an extensive overview of the embodiments. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the embodiments. Its sole purpose is to present some embodiments in a simplified form as a prelude to the more detailed description that is presented below. In view of the foregoing, an embodiment herein provides a system for
[006] A system for integrating one or more objects into monocular simultaneous localization and mapping (SLAM). The system comprises an input module, a memory and a processor in communication with the memory. The input module provides an image as an input image captured using a monocular camera. The processor further comprises a viewpoint initialization module, a bounding box detection module, an edge detection module, a characterization module, a correspondence identification module, an optimization module and an integration module. The viewpoint initialization module initializes the viewpoints of the one or more objects present in the input image using a computational neural network technique. The bounding box detection module detects bounding boxes of the one or more objects in the input image using a real time object detection technique. The edge detection module detects one or more edges using an LSD edge detector

and detected bounding boxes of the one or more objects. The characterization module characterizes the one or more objects as a set of 3D lines frame using the detected one or more edges and the initialized viewpoints. The correspondence identification module identifies the edge correspondence of the set of 3D lines frame corresponding to a set of 2D lines frame. The optimization module for optimizes the shape and pose of the set of 2D lines frame of the one or more objects, wherein the optimization results in the generation of the one or more objects. The integration module integrates the generated one or more objects into monocular simultaneous localization and mapping (SLAM) to get the trajectory of the movement of the monocular camera.
[007] In another aspect the embodiment here provides a method for integrating one or more objects into monocular simultaneous localization and mapping (SLAM). Initially, an image is provided as an input image captured using a monocular camera. Further the viewpoints of the one or more objects present in the input image are initialized using a computational neural network technique. IN the next step, bounding boxes of the one or more objects are detected in the input image using a real time object detection technique. Later the one or more edges are detected using an LSD edge detector and detected bounding boxes of the one or more objects. In the next step, the one or more objects are characterized as a set of 3D lines frame using the detected one or more edges and the initialized viewpoints. The edge correspondence of the set of 3D lines frame corresponding to a set of 2D lines frame are then identified. In the next step, the shape and pose of the set of 2D lines frame of the one or more objects are optimized, wherein the optimization results in the generation of the one or more objects. And finally, the generated one or more objects are integrated into monocular simultaneous localization and mapping (SLAM) to get the trajectory of the movement of the monocular camera.
[008] It should be appreciated by those skilled in the art that any block diagram herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like

represent various processes which may be substantially represented in computer readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.
BRIEF DESCRIPTION OF THE DRAWINGS
[009] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
[010] Fig. 1 illustrates a block diagram of a system for integrating one or more objects into monocular simultaneous localization and mapping (SLAM) according to an embodiment of the present disclosure;
[011] Fig. 2 shows an architectural diagram of the system for integrating one or more objects into monocular simultaneous localization and mapping (SLAM) according to an embodiment of the disclosure;
[012] Fig. 3 shows the projection of 3D line to 2D to calculate association cost with 2D line x1, x2 according to an embodiment of the disclosure;
[013] Fig. 4 shows the perspective projection of image lines according to an embodiment of the disclosure; and
[014] Fig. 5A-5B is a flowchart illustrating the steps involved in integrating one or more objects into monocular simultaneous localization and mapping (SLAM) according to an embodiment of the present disclosure.
DETAILED DESCRIPTION
[015] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other

implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[016] Referring now to the drawings, and more particularly to Fig. 1 through Fig. 5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[017] According to an embodiment of the disclosure, a system 100 for integrating one or more objects into monocular simultaneous localization and mapping (SLAM) is shown in the block diagram of Fig. 1. The system 100 is using a line based parameterization for category specific CAD models and thereby integrating objects into monocular SLAM. The disclosed system 100 and method also configured to estimate the trajectory of SLAM and object location for different monocular sequences. Fig. 2 shows an architectural workflow of the system 100 for integrating objects into monocular SLAM using line based parameterization.
[018] According to an embodiment of the disclosure, the system 100 further comprises an input module 102, a memory 104 and a processor 106 as shown in the block diagram of Fig. 1. The processor 106 works in communication with the memory 104. The processor 106 further comprises a plurality of modules. The plurality of modules accesses the set of algorithms stored in the memory 104 to perform certain functions. The processor 106 further comprises a viewpoint initialization module 108, a bounding box detection module 110, an edge detection module 112, a characterization module 114, a correspondence identification module 116, an optimization module 118 and an integration module 120.
[019] According to an embodiment of the disclosure the input module 102 is configured to provide an image as an input image to the system 100. The input image is captured using a monocular camera. The monocular camera is

normally fitted on to a moving device where navigation is required such as MAVs, drones, ground robots, under water surveillance etc. The moving device is normally moved in an unknown environment. The environment may include one or more devices in the vicinity. Thus, the input image captured by the monocular camera may contain one or more objects. The input module 102 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite.
[020] According to an embodiment of the disclosure, the processor 106 comprises the viewpoint initialization module 108. The viewpoint initialization module 108 is configured to initialize the viewpoints of the one or more objects present in the input image using a computational neural network technique. The computational neural network technique is trained using a training data set. The render for CNN pipeline is trained for category specific viewpoint estimation of the objects.
[021] In the present embodiment the emphasis have been laid on the use of category-level models as opposed to instance-level models for objects. To construct a line based category level model, each object is first characterized as a set of 3D lines that are common across all instances of the category. Any line based model is represented by a vector X of 6*m dimension, where m is the number of lines present in the parameterized model, each Li corresponding to a key edge of a model representing the object. Each of these m lines is represented by a 3D direction D and a 3D point M, one that lies on the line.

While the 3D point can be any point lying on the line, it is roughly chosen to be the midpoint of the edge of 3D CAD models.

[022] If no prior information about object is known then search space is a prohibitive 6*m dimensional space representing shape of the object. But based on the 3D annotation of CAD model, search space can be reduced so that while optimizing for shape only possible deformations in that object are looked at, rather than any arbitrary line deformation. A simple principle component analysis is performed on the annotated CAD model dataset to get the top seven linearly independent principle directions of the deformation. These eigen vectors are sorted based on their eigen values. The number seven is chosen based on the coverage of the eigen vectors.
[023] While solving for a shape, an object is represented by the mean shape plus weighted linear combination of the deformation directions. In such a shape representation, each object can be represented by those weights (or shape parameters, for each principle deformation direction. This linear subspace model has much lower dimension than R6m. This is easy to see, since there are various planar conditions and symmetry present in the objects.
[024] Mathematically, if is the mean shape of the category, and Vis are a deformation basis obtained from PCA over a collection of aligned ordered 3D CAD models as explained in this section, any object X obtained with shape parameters Xi can be represented as,

where B is the number of basis vectors (the top-B eigenvectors after PCA) and Λ is vector consisting of all λi.
[025] According to an embodiment of the disclosure, the system 100 further comprises the bounding box detection module 110. The bounding box detection module 110 is configured to detect bounding boxes of the one or more objects in the input image using a real time object detection technique. The real time object detection technique is performed using YOLO detector. The YOLO detector regresses bounding boxes on objects of interest.
[026] According to an embodiment of the disclosure, the system 100 further comprises the edge detection module 112. The edge detection module 112 is configured to detect one or more edges using an LSD edge detector and

detected bounding boxes of the one or more objects. The LSD detector outputs the line segments within the YOLO bounding boxes.
[027] According to an embodiment of the disclosure, the system 100 comprises the characterization module 114. The characterization module 114 is configured to characterize the one or more objects as a set of 3D lines frame using the detected one or more edges and the initialized viewpoints;
[028] According to an embodiment of the disclosure, the system 100 also comprises the correspondence identification module 116 or the edge correspondence identification module 116. The correspondence identification module 116 is configured to identify the edge correspondence of the set of 3D lines frame corresponding to a set of 2D lines frame. The system 100 is using render for CNN to estimate viewpoint of the one or more objects in the image. Render For CNN has been trained on large, category specific datasets for several objects, rendered using available 3D CAD models that are easily accessible. Models that are trained for the task of object viewpoint prediction on rendered dataset work very well when they are fine-tuned on large dataset comprising of real images A good pose estimation to find the correspondence between 3D CAD model and the input image and a good association is needed to estimate pose of object. As explained earlier, the approximate viewpoint of the object is obtained using the Render for CNN viewpoint network and introduce a method to compute approximate translation of object. This viewpoint and translation as initialization for a dictionary based RANSAC method was employed to get most suitable edge correspondences.
[029] The parameterization as discussed earlier allows for the representation of CAD models in terms of a set of vectors where each vector represents a line. To put it formally, the correspondence map Z was found from n 3D lines to m 2D line segments. First, the line segments in image are filtered using the bounding box data obtained from the object detector. A custom cost function was used to give a score to a 3D-2D correspondence as shown in Equation (6). C = C1 + k1*C2 + k2*C3 ……………………….. (6)

where,C1 accounts for angle and C2 andC3 account for the distance between the line and line segment.
[030] Further, the system 100 is also configured to compute translation and aforementioned costs. For computing Translation, apart from a viewpoint initialization, an approximate value of translation (Tx, Ty, Tz) is also needed for projection. Getting exact translation requires 3D length and projected 2D length of a line segment, but since the exact 3D information of object is not known, approximation of 3D model of that particular category of object was uesd. The information available from bounding box and mean 3D model was used to find translation approximation. Height and Width of bounding box are independently sufficient to get a good estimate of Tz given that object’s mean 3D model’s height and width matches mean model respectively. In order to get even better estimate in general case where both height and width of objects could deviate from mean model, a simple average of both was estimate.

where, fx , fy, u, v are taken from camera matrix, h and w are the height and width of bounding box and x and y are the top left corner of bounding box. kx and ky are constants obtained from mean 3D model.
[031] Further C1, C2, C3 is computed from equation (12), (13) and (14).

The projection of the 3D edge to image plane can be found by projecting any two points from the 3D line and then taking their direction and midpoint.

here, α 1 is some non-zero number used to get two points on line based on one point, M and direction, D and π is the projection function.
In equation (4), x1 and x2 are the end points of an edge detected by LSD (for some line segment to be picked as associated line, it has to be very close to the projected 3D line, the image here is exaggerated for representation purpose) and Ip1 and Ip2 are the projection of two points from 3D line. p1 and p2 are the perpendicular distances when is projected on Using simple projective geometry
gives

[032] Adding angle directly in cost function would create complication of adding distance with angles so instead we observe that value |p2-p1| captures the variation of angle between the two lines. This is used as C1

captures the perpendicular distance of the midpoint of from the
projected line This is used as C2 as follows

and lastly, distance between Mp and is minimized to pick the lines radially
closer to the projected line. This is used as C3 as follows:

So using equation 12, 13 and 14, the final association cost in equation 6 becomes

[033] According to an embodiment of the disclosure, a dictionary of 3-5 most representative CAD models was also generated for the category, represented by This can be done by taking mean of clusters formed by taking the k-mean
clustering of the CAD model dataset. Also, viewpoint sampled around azimuth initialization and translation around the computed T for RANSAC. Assuming the sampled set as and , respectively. Fig. 4 shows the projection of 3D line to
2D line to calculate association with
[034] The pseudocode can be written for RANSAC based association algorithm which iterates over dictionary models and sampled view points, projects them and calculate associated lines and cost of association. Association for a line in model with a view point and translation is the line segment in image which has the minimum cost corresponding to that line in model. Finally, the association pertaining to the lowest association cost is picked. The algorithm is shown below:

[035] According to an embodiment of the disclosure, the system 100 further comprises the optimization module 118. The optimization module 118 is configured to optimize the shape and pose of the set of 2D lines frame of the one

or more objects, wherein the optimization results in the generation of the one or more objects. Once the association information is known, an optimization problem is formulated to find Pose and Shape of the Object. Ceres toolbox is used for this purpose. The following subsections take a look at different constraints used in the formulation. The final cost function is

[036] Firstly pose constraints, Fig. 3 shows AB a 3D line projected to an image plane forming ab 2D line. The normal constructed by the cross product of oa and ob is perpendicular to the 3D line. Let M be the point on the line and D be the direction

taking the difference between two points M1 and M2 from same line

So the cost function is

R and T are the parameters going to be optimized
[037] Second is the normal constraint - Each category object has a base e.g. base of chair for sitting. Base of the object was defined as the plane which is parallel to ground plane when the object is kept in normal position. This observation was used and put in the constraint to force the base of object to be parallel to ground. Normal of the ground plane was considered to be the y-axis.

here is the y-axis and M1 and M2 belongs to the adjacent base lines from X.
[038] Third is the shape constraint - Eigen vector formulation discussed earlier was used to optimize for the shape of object.

Using these in equation 24 to get shape constraint

[039] And finally optimizing the pose and shape - The optimizer is called for pose, R and T, of the object with cost

followed by the call to optimizer for shape, λ's, of the object with cost

where is a regularizer that prevent shape parameters from deviating from
the category-model. Improvement in shape can result in improvement of the pose of object and vice-versa, thus, both optimizations are called iteratively to achieve better results.
[040] According to an embodiment of the disclosure, the system 100 comprises the integration module 120. The integration module 120 is configured to integrate the generated one or more objects into monocular simultaneous localization and mapping (SLAM) to get the trajectory of the movement of the monocular camera.
[041] Once the objects are identified post Ceres optimization. These objects are then used as the object landmarks. The object landmarks are used to augment 6 degree of freedom pose graph using the image sequence captured by the monocular camera by performing visual odometry as shown in Fig. 2. The pose graph is used to estimate the trajectory of the movement of monocular camera using the loop closure detection method. Similarly, pose graph is also used to retrieve 3D object model overlaid on the estimated trajectory using pose graph optimization (GTSAM).
[042] The category-models learned using line based approach are incorporated into a monocular SLAM back end. Here,

(3) represents rigid-body transform of a 3D point in camera frame at time i with respect to camera frame at timey. Tij is a 4x4 matrix represented below

If 3D coordinate of a world point with respect to frame i is then using the transformation Tij can be represent with respect to camera frame j as
[043] For a given set of relative pose measurement of robot across
all the frames the pose-SLAM problem defined as estimating
that maximizes the log-likelihood of relative pose measurements, which can be framed as problem of minimizing observation errors (minimizing the negative of log likelihood).

(27)
Where ∑ij is assumed to be the uncertainty associated with each pose
measurement In order to minimize the problem pose above, factor graphs
were employed using publicly available GTSAM framework to construct and
optimize the proposed factor graph model. Minimizing error function (24) and
(25) in an alternating manner with respect to object shape and pose parameters
yield estimated
shape(A) and pose for a given frame i. Pose observation obtained after shape
and pose error minimization form additional factors in SLAM factor graph,
therefore for each object node in the factor graph if pose of object is
denoted by , following error is minimized.

Here denotes data association function that uniquely identifies every object
Om observed so far. Finally object-SLAM error ε that jointly estimate robot pose and object poses using relative object pose observations is expressed as:

[044] According to an embodiment of the disclosure, the system 100 also comprises a display screen 122. The display screen 122 is configured to display

estimated trajectory and retrieved 3D object models.
[045] In operation, a flowchart 300 illustrating a method for integrating one or more objects into monocular simultaneous localization and mapping (SLAM). The method is using line based parameterization for the integration. Initially at step 202, the image is provided as the input image using a monocular camera. In the next step 204, the viewpoints of the one or more objects present in the input image are initialized using a computational neural network technique. The computational neural network technique is trained using a training data set. In the next step 206, bounding boxes of the one or more objects in the input image are detected using a real time object detection technique such as Yolo detector. The use of any other technique is well within the scope of this disclosure.
[046] In the next step 208, one or more edges are detected using an LSD edge detector and detected bounding boxes of the one or more objects. Further at step 210, the one or more objects are characterized as the set of 3D lines frame using the detected one or more edges and the initialized viewpoints. In the next step 212, the edge correspondence of the set of 3D lines frame corresponding to a set of 2D lines frame are identified.
[047] In the next step 214, the shape and pose of the set of 2D lines frame of the one or more objects are optimized. The optimization results in the generation of the one or more objects. The optimization is performed using Ceres optimization. Different constraints are used for the formulation of optimization. And finally at step 216, the generated one or more objects are integrated into monocular simultaneous localization and mapping (SLAM) to get the trajectory of the movement of the monocular camera.
[048] According to an embodiment of the disclosure, the system 100 can also be explained with the help examples and experimental results.
[049] The dataset was collected using monocular video in an indoor setting, comprising of office spaces and laboratory which constitute our dataset. The dataset was collected using a micro aerial vehicle (MAV) flying at a constant height above the ground. Sequence 1 and 2 of the dataset are elongated loops with 2 parallel sides, following dominant straight line motion while Sequence 3 is a

360◦ rotation in place with no translation from origin.
[050] Experimental results on multiple real world sequences comprising of different category objects vis-a-vis chair, table, and Laptop were explained. The present disclosure was emphasized to exploit key edges in the object, corresponding to the respective wire-frame model to obtain object trajectory and precisely estimate their pose in various real-world scenarios. The pipeline against the key point method was also evaluated by comparing the execution times. The time bottleneck for key point method during evaluation is in the forward pass of network. Here, frame processing time for both method was compared for a 856 × 480 image containing 3 objects. The hardware specifications for key point method are TitanX GPU with 12 GB memory and for line based method intel i5 processor with 8GB ram. Time per frame in keypoint method
= 3× inference time per object
= 3× 285 ms
= 855ms Time per frame in the method of present disclosure
= time per frame for LSD + 3× processing time per image
= 0.25 + 3× 120
= 360.25ms
[051] So, there is an increase in speed by more than 2 times for the same process.
[052] The embodiments of present disclosure herein solves the problems of time consumption and labor that is invested in annotating dataset to train Keypoint Network for different category objects. The disclosure provides a method and system for integrating objects into monocular SLAM using line based parameterization.
[053] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a

server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[054] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[055] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.

Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[056] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[057] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.

WE CLAIM:
1. A method (200) for integrating one or more objects into monocular
simultaneous localization and mapping (SLAM), the method comprising a
processor implemented steps of:
providing an image as an input image captured using a monocular camera (202);
initializing the viewpoints of the one or more objects present in the input image using a computational neural network technique (204);
detecting bounding boxes of the one or more objects in the input image using a real time object detection technique (206);
detecting one or more edges using an LSD edge detector and detected bounding boxes of the one or more objects (208);
characterizing the one or more objects as a set of 3D lines frame using the detected one or more edges and the initialized viewpoints (210);
identifying the edge correspondence of the set of 3D lines frame corresponding to a set of 2D lines frame (212);
optimizing the shape and pose of the set of 2D lines frame of the one or more objects, wherein the optimization results in the generation of the one or more objects (214); and
integrating the generated one or more objects into monocular simultaneous localization and mapping (SLAM) to get the trajectory of the movement of the monocular camera (216).
2. The method of claim 1 further comprising the step of displaying the trajectory on a display screen.
3. The method of claim 1 further comprising the step of retrieving the generated one or more objects as a 3D model overlaid on the trajectory.

4. The method of claim 1, wherein the real time object detection technique is performed using YOLO detector.
5. The method of claim 1, wherein the optimization is performed using Ceres toolbox.
6. The method of claim 1, wherein the integration is performed using a factor graph formulation.
7. The method of claim 1, wherein the computational neural network technique is trained using a training data set.
8. A system (100) for integrating one or more objects into monocular simultaneous localization and mapping (SLAM), the system comprises:
an input module (102) for providing an image as an input image captured using a monocular camera; a memory (104); and
a processor (106) in communication with the memory, the processor further comprising:
a viewpoint initialization module (108) for initializing the viewpoints of the one or more objects present in the input image using a computational neural network technique;
a bounding box detection module (110) for detecting bounding boxes of the one or more objects in the input image using a real time object detection technique;
an edge detection module (112) for detecting one or more edges using an LSD edge detector and detected bounding boxes of the one or more objects;

a characterization module (114) for characterizing the one or more objects as a set of 3D lines frame using the detected one or more edges and the initialized viewpoints;
a correspondence identification module (116) identifying the edge correspondence of the set of 3D lines frame corresponding to a set of 2D lines frame;
an optimization module (118) for optimizing the shape and pose of the set of 2D lines frame of the one or more objects, wherein the optimization results in the generation of the one or more objects; and an integration module (120) for integrating the generated one or more objects into monocular simultaneous localization and mapping (SLAM) to get the

Documents

Application Documents

#	Name	Date
1	201821047937-STATEMENT OF UNDERTAKING (FORM 3) [18-12-2018(online)].pdf	2018-12-18
2	201821047937-REQUEST FOR EXAMINATION (FORM-18) [18-12-2018(online)].pdf	2018-12-18
3	201821047937-FORM 18 [18-12-2018(online)].pdf	2018-12-18
4	201821047937-FORM 1 [18-12-2018(online)].pdf	2018-12-18
5	201821047937-FIGURE OF ABSTRACT [18-12-2018(online)].jpg	2018-12-18
6	201821047937-DRAWINGS [18-12-2018(online)].pdf	2018-12-18
7	201821047937-DECLARATION OF INVENTORSHIP (FORM 5) [18-12-2018(online)].pdf	2018-12-18
8	201821047937-COMPLETE SPECIFICATION [18-12-2018(online)].pdf	2018-12-18
9	201821047937-FORM-26 [11-02-2019(online)].pdf	2019-02-11
10	Abstract1.jpg	2019-07-26
11	201821047937-ORIGINAL UR 6(1A) FORM 26-130219.pdf	2019-11-30
12	201821047937-RELEVANT DOCUMENTS [18-06-2021(online)].pdf	2021-06-18
13	201821047937-PETITION UNDER RULE 137 [18-06-2021(online)].pdf	2021-06-18
14	201821047937-OTHERS [18-06-2021(online)].pdf	2021-06-18
15	201821047937-FER_SER_REPLY [18-06-2021(online)].pdf	2021-06-18
16	201821047937-COMPLETE SPECIFICATION [18-06-2021(online)].pdf	2021-06-18
17	201821047937-CLAIMS [18-06-2021(online)].pdf	2021-06-18
18	201821047937-FER.pdf	2021-10-18
19	201821047937-PatentCertificate08-11-2023.pdf	2023-11-08
20	201821047937-IntimationOfGrant08-11-2023.pdf	2023-11-08
21	201821047937-FORM 4 [06-08-2024(online)].pdf	2024-08-06

Search Strategy

1	searchE_17-12-2020.pdf