Systems And Methods For Generating Control Commands For Navigation By

< Back

Systems And Methods For Generating Control Commands For Navigation By An Agent In An Environment

Abstract: Visual servoing approach is utilized for tasks dealing with vision-based control of robots in many real-world applications. However, attaining precise alignment for unseen environments pose a challenge to existing visual servoing approaches. While classical approaches assume a perfect world, the recent data-driven approaches face issues when generalizing to both known and unseen environments. Present disclosure implements a deep model predictive visual servoing framework for achieving precise alignment with optimal trajectories and can generalize to existing and as well as new/unseen environments. The framework comprised in the system consists of a deep network for optical flow predictions, which are used along with a predictive model to forecast future optical flow. For generating an optimal set of velocities (also referred as control commands or optimal control commands and interchangeably used herein after), a control network is implemented by the present disclosure that can be trained on-the-fly without any supervision.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

12 November 2020

Publication Number

19/2022

Publication Type

INA

Invention Field

BIO-MEDICAL ENGINEERING

Status

Email

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2024-06-20

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9thFloor, NarimanPoint, Mumbai -400021,Maharashtra, India

Inventors

1. KUMAR, Gourav

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas,Kolkata - 700160, West Bengal, India

2. BHOWMICK, Brojeshwar

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas,Kolkata - 700160, West Bengal, India

3. KRISHNA, Krishnan Madhava

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

4. PANDYA, Harit

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

5. KATARA, Pushkal

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

6. SANCHAWALA, AadilMehdi JavidHusen

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

7. YADARTH, Venkata Sai Harish

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

8. GUPTA, Abhinav

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad - 500032, Telangana, India

Specification

DESC:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
SYSTEMS AND METHODS FOR GENERATING CONTROL COMMANDS FOR NAVIGATION BY AN AGENT IN AN ENVIRONMENT

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no. 202021049527, filed on November 12, 2020. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD
The disclosure herein generally relates to visual based manipulation or navigation of agents in environments, and, more particularly, to systems and methods for generating control commands for manipulation and/or navigation by an agent in an action space-based environment.

BACKGROUND
Visual servo (VS) control or visual servoing refers to the use of images to control the motion of an agent such as robots, and the like. The concept of visual servoing has been widely used in the field of robotics for a variety of tasks from manipulation to navigation of agents, such as robots, and the like. Mapping the observation space to the control/action space is the fundamental objective behind most of the visual servoing problems. The simplicity of the visual servoing approach makes it an attractive option for tasks dealing with vision-based control of robots in many real-world applications. However, attaining precise alignment for unseen environments pose a challenge to existing visual servoing approaches. While conventional approaches assume a perfect world, the recent data-driven approaches face issues when generalizing to both known and unseen environments.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a processor implemented method for generating a plurality of control commands for navigation by an agent in an environment. The method comprises: obtaining, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to an action space-based environment; iteratively performing, until a predicted optical loss reaches a threshold: receiving, via the first flow network executed by the one or more hardware processors, a current image corresponding to the action space-based environment and generating a desired optical flow based on the current image and the goal image, wherein a neural network is trained using the desired optical flow to obtain a trained neural network; receiving, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the action space-based environment and generating a current scene depth information thereof; generating, via a kinetic model executed by the one or more hardware processors, a predicted optical flow, using the current scene depth information; performing a comparison of the predicted optical flow and the desired optical flow to obtain the predicted optical loss; and generating, via the trained neural network (RLS) executed by the one or more hardware processors, one or more optimal control commands based on (i) the predicted optical loss and (ii) one or more previously executed control commands being initialized.
In an embodiment, the action space-based environment is at least one of a discrete action space environment and a continuous action space environment.
In an embodiment, the step of generating the current scene depth information thereof is preceded by generating a proxy flow using (i) the previous image and (ii) the current image corresponding to the action space-based environment; and converting the proxy flow to the current scene depth information.
In an embodiment, the predicted optical loss is indicative of a difference between the predicted optical flow and the desired optical flow.
In an embodiment, when the predicted optical loss reaches the threshold, the current image substantially matches the goal image.
In another aspect, there is provided a system for generating a plurality of control commands for navigation by an agent in an environment. The system comprises: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: obtain, via a first flow network executed by the one or more hardware processors, (i) a goal image corresponding to an action space-based environment; iteratively perform, until a predicted optical loss reaches a threshold: receiving, via the first flow network, a current image corresponding to the action space-based environment and generating a desired optical flow based on the current image and the goal image, wherein a neural network is trained using the desired optical flow to obtain a trained neural network; receiving, via a second flow network, (i) a previous image and (ii) the current image corresponding to the action space-based environment and generating a current scene depth information thereof; generating, via a kinetic model, a predicted optical flow, using the current scene depth information; performing a comparison of the predicted optical flow and the desired optical flow to obtain the predicted optical loss; and generating, via the trained neural network, one or more optimal control commands based on (i) the predicted optical loss and (ii) one or more previously executed control commands being initialized.
In an embodiment, the action space-based environment is at least one of a discrete action space environment and a continuous action space environment.
In an embodiment, wherein the current scene depth information is generated by generating a proxy flow using (i) the previous image and (ii) the current image corresponding to the action space-based environment; and converting the proxy flow to the current scene depth information.
In an embodiment, the predicted optical loss is indicative of a difference between the predicted optical flow and the desired optical flow.
In an embodiment, when the predicted optical loss reaches the threshold, the current image substantially matches the goal image.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause generation of a plurality of control commands for navigation by an agent in an environment by: obtaining, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to an action space-based environment; iteratively performing, until a predicted optical loss reaches a threshold: receiving, via the first flow network executed by the one or more hardware processors, a current image corresponding to the action space-based environment and generating a desired optical flow based on the current image and the goal image, wherein a neural network is trained using the desired optical flow to obtain a trained neural network; receiving, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the action space-based environment and generating a current scene depth information thereof; generating, via a kinetic model executed by the one or more hardware processors, a predicted optical flow, using the current scene depth information; performing a comparison of the predicted optical flow and the desired optical flow to obtain the predicted optical loss; and generating, via the trained neural network (RLS) executed by the one or more hardware processors, one or more optimal control commands based on (i) the predicted optical loss and (ii) one or more previously executed control commands being initialized.
In an embodiment, the action space-based environment is at least one of a discrete action space environment and a continuous action space environment.
In an embodiment, the step of generating the current scene depth information thereof is preceded by generating a proxy flow using (i) the previous image and (ii) the current image corresponding to the action space-based environment; and converting the proxy flow to the current scene depth information.
In an embodiment, the predicted optical loss is indicative of a difference between the predicted optical flow and the desired optical flow.
In an embodiment, when the predicted optical loss reaches the threshold, the current image substantially matches the goal image.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 depicts a system for generating control commands for navigation by an agent in an environment, in accordance with an embodiment of the present disclosure.
FIG. 2 depicts a block diagram of an architecture as implemented by the system of FIG. 1 for generating control commands for navigation by an agent in an environment, in accordance with an embodiment of the present disclosure.
FIG. 3 depicts an exemplary flow chart illustrating a method for generating control commands for navigation by an agent in an environment, using the system 100 of FIG. 1, in accordance with an embodiment of the present disclosure
FIG. 4A depicts a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to Photometric Error and Number of Iterations, in accordance with an embodiment of the present disclosure.
FIG. 4B depicts a graphical representation illustrating performance comparison of the method of the present disclosure and conventional approaches with reference to camera trajectories, in accordance with an embodiment of the present disclosure.
FIG. 4C depicts a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to Photometric Error and Number of Iterations with different lambda values being plotted, in accordance with an embodiment of the present disclosure.
FIG. 5 depicts a graphical representation illustrating performance on Fully Unsupervised Controller as implemented by the system of FIG. 1, in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Visual Servoing or Visual servo (VS) control refers to the use of images to control the motion of an agent (e.g., robot, drone, unmanned aerial vehicle, and the like). The concept of visual servoing has been widely used in the field of robotics for a variety of tasks from manipulation to navigation. Mapping the observation space to the control/action space is the fundamental objective behind most of the visual servoing problems. Conventionally, visual servoing has highly dependent on handcrafted image features, environment modeling, and an accurate understanding of system dynamics for the generation of control commands. Sensitivity to the accuracy of the available information also prevents such systems from adjusting to the statistical irregularities of the world. To mitigate these issues, recent approaches aim to learn supervised models that predict relative camera poses between the current and the desired images. Although this overcomes the requirements of having handcrafted image features or any explicit information about the environment, it fails to generalize in unseen environments. Other approaches further attempt to exploit information from intermediate representations like optical flow, however, these use classical visual servoing controller which only looks one-step ahead hence results in sub-optimal trajectories. On the other hand, many deep reinforcement learning based approaches as known in the art do not show their convergence in a six-degree of field (6-DoF) domain due to an intractable number of samples in higher dimensional actions and continuous state space. This narrows down their scope of performance in real-world visual servoing tasks for precise alignment. Furthermore, they also need to be extensively retrained in newer environments as a result, their generalization capabilities are limited. A few approaches employed a model-based learning framework for visual servoing, and these have attempted to learn a forward predictive model to represent the process and a policy for planning over the predictive model. However, their predictive model is trained to minimize the appearance loss which is difficult to optimize for a longer horizon and does not generalize to novel environments. Furthermore, the policy is obtained either through sampling or by learning from human demonstrations that do not scale to a higher dimensional action space.
Visual Servoing or Visual servo (VS) control refers to the use of images to control the motion of an agent (e.g., robot, drone, unmanned aerial vehicle, and the like). The objective is to generate control commands that minimize the difference between current and goal images (or their representations). To achieve this task, classical visual servoing approaches extract a set of visual features from images and define a cost function as the least squared error in these visual features. The control law is to perform gradient descent in feature space, which is then mapped to the robot’s velocities using an Image Jacobian. Conventional visual servoing approaches employ local appearance-based features such as key points, lines, and contours to describe the scene. On the other hand, other conventional direct visual servoing approaches skip the feature extraction and directly minimize the difference in images. Although these direct approaches mitigate the issue of incorrect matches, they do not converge for larger camera transformations. One of the limitations of classical visual servoing methods is the requirement of the depth of the visual features while computing the Image Jacobian, which is generally not available in the case of a monocular camera.
Deep Visual Servoing: Recent deep learning driven approaches learn deep neural networks that directly aim to predict the relative camera pose between the current and the desired image. The controller then takes a small step to minimize this error. These neural networks are learned in a supervised fashion and consequently do not generalize well to unseen environments. A very recent conventional approach (e.g., refer ‘Y. V. S. Harish, H. Pandya, A. Gaud, S. Terupally, S. Shankar, and K. M. Krishna. Dfvs: Deep flow guided scene agnostic image based visual servoing. In IEEE ICRA, pages 3817–3823. 313 IEEE, 2020.’) combines deep learning with visual servoing by learning intermediate flow representations and minimizing their difference using a classical visual servoing controller. This conventional approach effectively generalizes to unseen environments in 6-DoF. However, since the controller optimizes only for a single time step, it generates sub-optimal trajectories and could get stuck in local minima as a result of this greedy behavior.
Visual Navigation using Reinforcement Learning: Visual servoing could also be posed as a target driven visual navigation problem. Motivated from the success attained by deep reinforcement learning (DRL) approaches in maze navigation tasks, there has been an increase in interest in applying imitation learning or reinforcement learning methods for visual navigation in indoor environments. In contrast to classical visual servoing approaches that greedily plan one-time step, DRL approaches plan an optimal policy over a longer horizon even when there is no overlap between the current and desired image. However, such model free end-to-end learning-based approaches are sample inefficient, thus do not scale to higher dimensional continuous actions and face difficulty while generalizing to new environments.
Model Based Visual Control: Model predictive control (MPC) approaches have been successful in learning complex skills in robotics such as controlling quadrotors, and humanoids using accurate system dynamics. Conventionally, key points-based MPC models have been proposed for use in in Image-based visual Servo (IBVS) to generate optimal policies under the assumption of accurate matching and using a handful of key points. Classical MPC approaches have two limitations: firstly, they require accurate dynamics, and secondly, they do not scale well in large state spaces.
Recent deep learning approaches have aimed to improve upon this limitation by simultaneously learning the model dynamics along with the policies. For learning Visio-motor control, recent approaches propose a deep network to learn forward dynamics by predicting future observations. To learn the policies along with dynamics, there are a few choices: (i) projecting the latent space into a Cartesian world and classical approaches to generate control. However, directly learning camera pose could be inaccurate for unseen environments and 6-DoF actions. (ii) use a neural network to approximate policy which is learned in a supervised fashion from human demonstrations. However, the issue with this method is obtaining a large and diverse dataset of human demonstrations which is expensive. (iii) Another approach to learn the policy in an unsupervised manner is by sampling from action space, which is inefficient when the action and the state spaces are large.
Embodiments of the present disclosure provide systems and method that bridge the gap between classical and learning-based control for 6-DoF visual servoing in known and as well as new/unseen environments with action and state space. An optical flow is selected to represent states instead of directly working with images. The dense optical flow is predicted using deep neural networks. Subsequently, a predictive model is reformulated for forecasting the evolution of states given a sequence of actions (velocities). A recurrent control network is then learnt on-the-fly in an unsupervised manner for computing an optimal set of velocity for a given goal state. The present disclosure demonstrates superior performance vis-a-vis Deep Visual Servoing methods due to a receding horizon controller even as the framework generalizes to new environments without needing to retrain or finetune. The controller regresses to a discrete/continuous action space of outputs over 6-DoF.
In other words, in the present disclosure, a deep model predictive control strategy/framework is implemented that exploits the visual servoing concept in a principled manner. This is achieved by formulating a unique state prediction model for visual servoing based on dense off-the-shelf/unsupervised optical flow predictions. The prediction model is employed to generate optimal velocity commands using a control network. The present disclosure enables an efficient unsupervised strategy for online optimization of control network, wherein the method of the present disclosure has been validated on a photo-realistic visual servoing dataset (publicly available dataset), comparing against an exhaustive set of baselines among classical, deep supervised and model predictive visual servoing approaches. Furthermore, the present disclosure showcases the robustness of the implemented framework in presence of actuation errors. This demonstrates the effectiveness of the method of the present disclosure for high dimensional control in new/unseen situations. More specifically, the present disclosure describes the following:
An implementation of an unsupervised Deep Model Predictive Control pipeline for visual servoing using optical flow as intermediate representation in 6-DoF.
The control actions are learned unsupervised as the LSTM decoder strives to predict future controls over a time horizon such that the optical flow accrued as a consequence of these predictions matches the desired flow. The desired flow between the current and desired images is computed either through a supervised or unsupervised framework as known in the art.
The online optimization over the Model predictive control (MPC) framework enables the method of the present disclosure to adapt to any environment even with inaccurate flow predictions and system dynamics information.
The method of the present disclosure is compared with existing deep visual servoing methods by evaluating on parameters such as translation error, rotation error, trajectory length, total no. of iterations, time per iteration, and the like. Through experimental results, it has been observed that the method of the present disclosure exhibits significant convergence and is faster than the state-of-the-art deep visual servoing methods.
Referring now to the drawings, and more particularly to FIGS. 1 through 5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 depicts a system 100 for generating control commands for navigation by an agent in an environment, in accordance with an embodiment of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is/are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 comprises current image, previous image, and goal image corresponding to an action space-based environment. The database 108 further stores information on action space-based environments.
The database 108 further stores information on desired optical flow, proxy flow, current scene depth information, predicted optical loss, control commands. Information stored in the database 108 further comprises one or more previously executed control commands being initialized.
The database further comprises one or more networks such as one or more flow networks, one or more neural network(s) which when invoked and executed perform corresponding steps/actions as per the requirement by the system 100 to perform the methodologies described herein. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
FIG. 2, with reference to FIG. 1, depicts a block diagram of an architecture as implemented by the system of 100 FIG. 1 for generating control commands for navigation by an agent in an environment, in accordance with an embodiment of the present disclosure.
FIG. 3, with reference to FIGS. 1-2, depicts an exemplary flow chart illustrating a method for generating control commands for navigation by an agent in an environment, using the system 100 of FIG. 1, in accordance with an embodiment of the present disclosure.
Referring to FIG. 3, in an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to components of the system 100 of FIG. 1, the block diagram of FIG. 2, and the flow diagram as depicted in FIG. 3. In an embodiment, at step 202 of the present disclosure, the one or more hardware processors 104 obtain, via a first flow network executed by the one or more hardware processors, (i) a goal image corresponding to an action space-based environment. As depicted in the block diagram of FIG. 2, the first flow network (e.g., refer flow network that is in the first row of the block diagram) receives the goal image (I^*). In an embodiment, the action space-based environment is at least one of a discrete action space environment, and a continuous action space environment.
In an embodiment, at step 204 of the present disclosure, the one or more hardware processors 104 (i) receive, via the first flow network executed by the one or more hardware processors, a current image corresponding to the action space-based environment and (ii) generate a desired optical flow based on the current image and the goal image. As depicted in the block diagram of FIG. 2, the first flow network (e.g., refer flow network that is in the first row of the block diagram) receives the current image (I_t) at each iteration.
In an embodiment, at step 206 of the present disclosure, the one or more hardware processors 104 receive, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the action space-based environment and generating a current scene depth information thereof. As depicted in the block diagram of FIG. 2, the second flow network (e.g., refer flow network that is in the third row of the block diagram) receives the current image (I_t) and the previous image (I_(t-1)) as input at each iteration and generates the current scene depth information. Prior to generating the current scene depth information, the one or more hardware processors 104 generates (or may generate) a proxy flow using (i) the previous image and (ii) the current image corresponding to the action space-based environment. The hardware processors 104 then converts the proxy flow to the current scene depth information.
The steps 202 till 206 are better understood by way of the following description provided as exemplary explanation.
In the present disclosure, the problem of visual servoing is considered as a target driven navigation in unseen environments using an image capturing device (e.g., a monocular camera). Given I_t, the observation of an agent (e.g., robot, an agent, unmanned aerial vehicle (UAV), and the like), in the form of monocular Red, Green, Blue (RGB) image at any time instant t and the desired observation I^*. The agent may comprise one or more of one or more actuators, one or more sensors and the like for performing one or more tasks such as manipulation, navigation, and the like. The one or more actuators, one or more sensors and the like may be either an integral part of the agent or externally connected to the agent via one or more I/O interfaces as known in the art. The goal is to generate optimal control commands [v_t,?_t ] in 6-DoF that minimizes the photometric error (also referred as predicted optical loss or flow loss and interchangeably used herein) between I^* and I_t. It is assumed by the system and method of the present disclosure the image capturing device is attached to the agent (also known as eye-in-hand configuration) and is calibrated. It is further assumed by the system and method of the present disclosure that the environment is collision free and there exists a partial overlap between the initial and desired image. In contrast to existing visual navigation approaches, the system and method of the present disclosure aim to generate trajectories in continuous actions (in 6-DoF) and state space (as depicted in FIG. 2).
Instead of directly planning in image observation space, the present disclosure employs optical flow as an intermediate visual representation for encoding differences in images (e.g., the differences in the goal image and the current image serving as predicted optical loss/flow loss as depicted in FIG. 2). Optical flow encodes displacement for every pixel which is more relevant information for feature tracking and matching as compared to pixel intensities. As a result, optical flow has been successfully used in motion estimation and visual servoing. Furthermore, in the presence of translation camera motion in a static environment, dense flow features could also be used for estimating image depth. Although, the present disclosure does not constraint itself to any specific network for flow or depth estimation, a pre-trained neural network, Flownet 2 (e.g., refer ‘E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks, 2016.’) is implemented by the system 100 for flow estimation without any fine-tuning. To showcase that the method of the present disclosure can achieve similar performance even when trained without any supervision, Flownet 2 has been replaced with another network architecture DDFlow (data distillation Flow – refer ‘P. Liu, I. King, M. R. Lyu, and J. Xu. Ddflow: Learning optical flow with unlabeled data distillation, 2019’), which has been trained from scratch using only images in an unsupervised manner.
Referring to steps of FIG. 3 of the present disclosure, at step 208, the one or more hardware processors 104 generating, via a kinetic model (e.g., also referred as an interaction matrix as depicted in FIG. 2) executed by the one or more hardware processors, a predicted optical flow, using the current scene depth information. The step of 208 can be better understood by way of the following description provided as exemplary explanation.
The fundamental objective of IBVS is to minimize the error in visual features between current and desired images (goal image), e_t=?s(I_t )-s(I^*)?, which is solved by performing gradient descent in feature space, resulting in following control law:
V_t=?L^+ (S_t-S^*) (1)
where V_t=[v_t,?_t ] is the velocity command (control command) generated by the controller in 6-DoF, S_t and S^* are visual features extracted from images I_t and I^* respectively. The gradient descent step size is depicted by ? and ?(.)?^+ denotes psuedo-inverse operation. The image Jacobian or the interaction matrix mapping the camera velocity to rate of change of features is given by,
L(Z_t )=[¦((-1)/Z_t &0&x/Z_t &xy&-(1+x^2)&y@0&(-1)/Z_t &y/Z_t &1+y^2&-xy&-x)] (2)
where, x and y are the normalized image coordinate and Z_t is the scene’s depth at (x,y,t). For a small time interval d_t, IBVS control law, i.e., equation (1) can be reformulated as S_(t+d_t )=S_t+L(z)V_t, which can be rewritten using optical flow as visual features as,
F(I_t,I_(t+d_t ) )=L(z)V_t (3)
This equation gives the predictive model that could be employed by the system to reconstruct the future states (flows in the case of the present disclosure) from the current velocity. This also points out that generating an optimal set of actions in terms of velocities V_(t+1:t+T)^* is a critical step for the convergence of the system towards the desired image. The present disclosure achieves this task by online trajectory optimization using the predictive model and a control network described in equation (3) as described below.
Further, a neural network (e.g., Recurrent Latent State (RLS) neural network) is trained using the desired optical flow to obtain a trained neural network. This training is achieved at each iteration until the current image substantial matches the goal image. The training of neural network is also referred as online trajectory optimization using Neural Model Predictive Controller(s) (MPC). The online trajectory optimization using Neural MPC can be better understood by way of the following description provided as exemplary explanation.
Model predictive controllers are proven to work well for a wide variety of tasks in which accurate analytical models are difficult to obtain. A model predictive controller aims to generate a set of optimal control commands V_(t+1:t+T)^* which minimizes a given cost function C(X ^_(t+1:t+T),V_(t+1:t+T)) over predicted state X ^ and control inputs V for some finite time horizon T. In order to formulate visual servoing as MPC with intermediate flow representations, the cost function is defined as mean squared error in flow between any two given images. Then MPC objective. expressed as below,
V_(t+1:t+T)^*=¦(arg?min@V_(t+1:t+T) )?F ^(I_t,I^* )-F ^(V_(t+1:t+T))? (4)
is then to generate a set of velocities V_(t+1:t+T)^* that minimize the error between the desired flow F ^(I_t,I^* ) predicted by the flow network and the generated flow F ^(V_(t+1:t+T)). Exploiting the additive nature of optical flows (i.e., F(I_1,I_3 )=F(I_1,I_2 )+F(I_2,I_3 ), the generated flow could be written as:
F ^(V_(t+1:t+T) )=?_(k=1)^T¦?[L(Z_t)V_(t+k)]? (5)
using the predictive model from equation (3). This formulation provides optimal trajectory over a horizon T as compared to classical/conventional IBVS controller, equation (1) that greedily optimizes a single step. Another advantage of using the method of the present disclosure is signification decrease in computation time as compared to the conventional IBVS controller, as the overhead of matrix inversion required for computation of pseudo inverse L^+=?(L^T L+µdiag(L^T L))?^(-1).
The above formulation allows the present disclosure and its system and method to control in action space-based environment (e.g., such as a continuous 6-DoF action space environment). However, since the dimensionality of the states (e.g., flow representations) are large, the conventional MPC solvers could not be used to optimize equation (4). To tackle this issue, a sampling based Cross Entropy Method (CEM) was used by conventional approaches. However, sampling is inefficient especially in higher dimensional continuous action spaces. Another conventional approach employed a neural network for policy generation. This conventional approach trained its network using human demonstrations. However, obtaining human demonstrations are expensive especially for 6-DoF. The method of the present disclosure is compared with these conventional approaches in the below experimental results.
In the present disclosure, a recurrent neural network such as Recurrent Latent Space (RLS) has been used by the system 100 and executed to generate one or more optimal control commands (e.g., also referred as velocity comments). This choice is more efficient over sampling-based approaches such as CEM since, Recurrent Neural Network (RNN) can be directly trained using a Back Propagation Through Time (BPTT). Furthermore, for sequence prediction tasks RNN is a natural choice over a feed forward neural network. To train the control network of the present disclosure, the system 100 employed predictions from the flow network as targets. Thus, the target remains fixed for online trajectory optimization. Note that the linearized predictive model of the present disclosure depends only on the depth of the scene, and assuming depth consistency for a smaller horizon, the MPC objective is written only in terms of control. Hence, the control network as depicted in FIG. 2 of the present disclosure could easily be trained in supervised manner by minimizing the flow loss:
L_flow=?F ^(V ^_(t+1:t+T) )-F(I_t,I^* )?=??_(k=1)^T¦?[L(Z_t)V ^_(t+k)]?-F(I_t,I^* )? (6)
where, the RNN g(V_t,?) could be used to generate control command at next time instance V ^_(t+1)=g(V_t,?) given the current control command (or given the current velocity). Hence, given previous velocity V_(t-1) and tries to improve the predicted velocity and compensate for approximations in predictive model by adjusting its parameters ? accordingly to such that the predicted flow loss is minimized.
In an embodiment, at step 210 of the present disclosure, the one or more hardware processors 104 perform a comparison of the predicted optical flow and the desired optical flow to obtain the predicted optical loss. The predicted optical loss indicates a difference between the predicted optical flow and the desired optical flow, in one embodiment of the present disclosure.
In an embodiment, at step 212 of the present disclosure, the one or more hardware processors 104 generate, via the trained neural network (RLS) executed by the one or more hardware processors, one or more optimal control commands based on (i) the predicted optical loss and (ii) one or more previously executed control commands being initialized. The one or more optimal control commands are then executed (e.g., executed either by the agent or any external or internal computing device connected to the agent) for manipulating of the agent or for navigation of the agent in the action space-based environment. In an embodiment, the system 100 performing the above steps of the method depicted in FIG. 2 and 3 may be either an integral part of the agent (e.g., drone, robot, UAV, and the like) or externally attached/communicatively connected to the agent via one or more I/O interfaces (e.g., input/output interfaces as known in the art). The optimal control commands comprise, but are not limited to, direction at which the agent should move, action(s) that the agent needs to perform, and the like. The steps 210 and 212 can be better understood by way of the following description provided as exemplary explanation.
In the present disclosure, a Long-Short-Term Memory (LSTM) architecture with 5 LSTM cell units as RNN was used by the system 100 for gnerating velocities. The sequence length (length of planning horizon as T) is a tuning parameter that trades-off between precision and computation time. In the present disclosure, a sequence length T=5 has been selected. The network receives previsou 6-DoF velocity V_(t-1) (e.g., also referred as one or more previsouly executed control commands being initialized) as input and predicts generated flows F ^_(t:t+T). The network is trained online in supervised manner for M=100 iterations by minimizing the predicted flow loss, equation (6), where the predicted optical flow is generated by the flow network of the system. After the neural network is trained for M iterations, V_(t+1) is executed and next image I_(t+1) is observed, which is used to predict next optical/target flow ?F(I?_(t+1),I^*). Subsequently, the interaction matix is recomputed L(Z_(t+1)) based on updated depth Z_(t+1) using ?F(I?_t,I_(t+1)) as the proxy flow.
In an embodiment of the present disclosure, the steps 204 till 212 are iteratively performed until the predicted optical loss reaches a threshold (e.g., a dynamically determined threshold, an empirically determined threshold, or a pre-defined threshold). In other words, until the current image substantially matches the goal image, the steps 204 till 212 are iteratively performed by the system 100. Once the predicted optical loss reaches the threshold, the current image tends to substantially match the goal image. For instance, assuming that the threshold is x% loss or at least ‘y%’ match, wherein values of ‘x’ and ‘y’ can vary depending upon the implementation of the system for a given action space-based environment type. The threshold as mentioned and/or expressed in terms of ‘x%’ and/or ‘y%’ may also be referred as convergence threshold and may be interchangeably used herein, in one embodiment of the present disclosure. For sake of brevity, assuming that the threshold is 5% loss or 95% matching threshold. In such scenarios, the steps 204 till 212 are iteratively performed until the predicted optical loss between the predicted optical flow and the desired optical flow is reduced to 5%. In other words, the difference in the scene information of the current image when compared with the goal image shall result in 5%. Once this is achieved, it is presumed that the agent has more or less has reached the goal image or the given goal (or the objective). In other embodiment, the difference in the scene information of the current image when compared with the goal image shall result in less than or equal to x% (e.g., =5%). This also means that there is at least a 95% match of the scene information of the current image when compared to that the scene information of the goal image. In other embodiment, the scene information of the current image when compared with the goal image shall result in a match greater than or equal to y% (e.g., =95%). More specifically, scene information comprised in the current image substantially matches scene information comprised in the goal image that is specific to the corresponding action space-based environment. It is to be understood by a person having ordinary skill in the art or person skilled in the art that the value of ‘x’ and ‘y’ can take any value depending upon the implementation and environment type and such values of ‘x’ and ‘y’ shall not be construed as limiting the scope of the present disclosure. In an embodiment, once the environment is detected/identified, the threshold may be dynamically determined and configured in real-time. In another embodiment, the threshold may be pre-defined and configured by the system 100.
The entire approach/method of the present disclosure can be further better understood by way of following pseudo code provided as example:
Require: I^*, e Goal image, convergence threshold
Initialize v_0 with random velocity
while ?I-I^* ?=e do convergence criterion
I_t?get-current-obs() obtain the current RGB observation from camera
Predict-optical flow (?F(I?_t,I^*)) Predict optical flow (target flow) using flow network
L_t:= compute-interaction-matrix (?F(I?_t ?-I?_(t-1)))
for m=0?M do Online training of the control network
V ^_(t:t+T)?g(V_(t-1),?_m) Velocity predictions from the control network
F ^(V ^_(t+1:t+T) )=?_(k=1)^T¦?[L(Z_t)V ^_(t+k)]? generate flow using prediction model
L_flow=?F ^(V ^_(t+1:t+T) )-F(I_t,I^* )? Compute predicted optical/flow loss
?_(m+1)??_m-??L_flow Update control network parameters
end for
V_(t+1)=V ^_(t+1) Execute the next control command
end while
Experimental Results:
In the present disclosure, an online training process has been described which enables the agent (e.g., robot) to learn optimal control commands on the go thus making the controller performance independent of the environment it is deployed in. To validate this, results are shown on photo realistic 3D baseline environments as conventional proposed. The benchmark comprises of 10 indoor photo-realistic environments from the Gibson dataset in the Habitat simulation engine as known in the art. A free-flying RGB camera has been used by the agent so that the agent can navigate in all 6-DoFs without any constraint. To compare the method of the present disclosure, the following baselines have been considered:
Photometric visual servoing (e.g., refer ‘C. Collewet and E. Marchand. Photometric visual servoing. IEEE TRO, 27(4):828–834, 2011’): PhotoVS is a learning-free classical visual servoing approach that considers raw image intensities as visual features.
Servonet (e.g., refer ‘A. Saxena, H. Pandya, G. Kumar, A. Gaud, and K. M. Krishna. Exploring convolutional networks for end-to-end visual servoing. In IEEE ICRA, pages 3817–3823. IEEE, 2017’), is a supervised deep learning approach that attempts to predict relative camera pose from a given image pair. This approach is benchmarked without any retraining/fine-tuning.
DFVS (e.g., refer ‘Y. V. S. Harish, H. Pandya, A. Gaud, S. Terupally, S. Shankar, and K. M. Krishna. Dfvs: Deep flow guided scene agnostic image based visual servoing. In IEEE ICRA, pages 3817–3823. IEEE, 2020’): similar to the method of the present disclosure, DFVS employs deep flow representations, however they use classical IBVS controller. To evaluate present disclosure’s controller’s performance in isolation from flow network and predictive model comparison is also performed against baselines (a) Fully connected Neural Network: and (b) Stochastic Optimization (e.g., refer ‘L.-Y. Deng. The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation, and machine learning, 2006.’): known as the cross-entropy method.
Simulation results on Benchmark:
The present disclosure reports and quantitative results on the benchmark. It can be seen from the below Table 1, the controller of the present disclosure outperforms the current state-of-the-art Image Based Visual Servoing technique (e.g., refer ‘Y. V. S. Harish, H. Pandya, A. Gaud, S. Terupally, S. Shankar, and K. M. Krishna. Dfvs: Deep flow guided scene agnostic image based visual servoing. In IEEE ICRA, pages 3817–3823. IEEE, 2020’) in total number of iterations and time taken to reach the desired goal while achieving marginally superior pose error. Conventional approaches such as PhotoVS and Servonet were unable to converge in 5 and 6 out of 10 scenes in the benchmark respectively.
Table 1
Approaches/method T. Error (meters) R. Error (degrees) Tj. Length Iterations
Initial Pose Error 1.6566 21.4166 - -
DFVS (prior art) 0.0322 1.7088 1.7277 1009.2222
NN+MPC (method of the present disclosure) 0.102 2.62 1.16* 344*
CEM (prior art) + MPC 0.0513 0.8644 0.89 859.6667
LSTM+MPC (method of the present disclosure) 0.0296 0.5977 1.081 557.4444

It can also be seen in Table 1, from comparison with the baseline (a) Fully connected Neural network that it gets stuck in local minima and results in larger pose error. Furthermore, CEM results in shorter trajectories but takes significantly large amount of time to converge as compared to the method of the present disclosure which is intuitive since conventional sampling approaches are inefficient but statistically complete. “Nvidia Geforce GTX 1080-Ti Pascal” GPU was used by the system and method of the present disclosure during experiments to benchmark these approaches. For an MPC iteration, the method of the present disclosure 0.8 sec which makes method of the present disclosure suitable for deploying in real-time.
Controller Performance and Trajectory Comparison:
The present disclosure next compares the trajectories taken by other conventional visual servoing methods such as DFVS and PhotoVS and controller performance with a standard feed forward neural network (NN) and CEM (prior art) to reach the desired goal (or goal image in this case). The photometric convergence and trajectory plots are shown in FIGS. 4A through 4C. More specifically, FIG. 4A, with reference to FIGS. 1 through 3, depicts a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to Photometric Error and Number of Iterations, in accordance with an embodiment of the present disclosure. More specifically, FIG. 4A depicts a graphical representation of Photometric Error and Number of Iterations, wherein method of the present disclosure [MPC + LSTM] converges in least number of iterations compared to other conventional approaches. It can be observed from the FIG. 4A that photometric error steadily reduces to less than 500 much faster than other methods.
FIG. 4B, with reference to FIGS. 1 through 4A, depicts a graphical representation illustrating performance comparison of the method of the present disclosure and conventional approaches with reference to camera trajectories, in accordance with an embodiment of the present disclosure. As seen from the trajectory plots depicted in FIG. 4B, CEM with predictive model tends to reach the goal in an optimal path followed by the method of the present disclosure, while the method of the present disclosure is much faster in converging to a photometric error of less than 500 as compared to others.
FIG. 4C, with reference to FIGS. 1 through 4B, depicts a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to Photometric Error and Number of Iterations with different lambda values being plotted, in accordance with an embodiment of the present disclosure. More specifically, FIG. 4C depicts Photometric error (vs) Number of Iterations - Plotting different lambda values from DFVS which shows that the method of the present dsiclosure converges much faster than the state-of-the-art methods/approaches.
Controller performance with Unsupervised flow:
To showcase that the deepMPC architecture of the present disclosure is also capable of generating optimal control commands even when trained in fully unsupervised manner i.e., here even the flow network is trained without supervision using only images. The system and method of the present disclosure use DDFlow (e.g., refer ‘P. Liu, I. King, M. R. Lyu, and J. Xu. Ddflow: Learning optical flow with unlabeled data distillation, 2019.’) instead of Flownet2 (e.g., refer ‘E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks, 2016.’) to generate the predicted optical flow. For this experiment, the flow network as implemented by the system 100 was trained offline on two scenes from the benchmark. The photometric error plots comparing the unsupervised flow network with pretrained Flownet model of the present disclosure are shown in FIG. 5. More specifically, FIG. 5, with reference to FIGS. 1 through 4C, depicts a graphical representation illustrating performance on Fully Unsupervised Controller as implemented by the system 100 of FIG. 1, in accordance with an embodiment of the present disclosure.
It can be seen from the graphical representation depicted in FIG. 5 that the controller of the present disclosure successfully converges for both the environments when Deep MPC architecture of the system 100 is trained on flow computed using DDFlow, thereby making pipeline/method of the present disclosure unsupervised. No ground truth or annotations were provided, and the optical flow was learned in an unsupervised manner without any unlabeled data. In FIG. 5, dashed lines represent the supervised method (Flownet) and bold lines represent the unsupervised one (DDFlow) for obtaining optical flow on two different scenes.
Stability control: A comparison with conventional approach DFVS:
Value of ?(0.01) as reported in the conventional approach DFVS was changed by dividing it by 2, 5 and multiplying it by 2. It was observed/found that performance of the controller of the conventional approach DFVS depends a lot on the value of ? and hampers the convergence of their pipeline. When value of ? was changed by dividing it by 2, it was observed that the pipeline took more time to converge. Dividing value of ? by 5 made it converge faster and when multiplied it by 2 it failed to converge as shown towards the right side in 3. This is in stark contrast to the method of the present disclosure where the Deep Controller learns the velocity scale by itself.
Embodiments of the present disclosure provide systems and methods generating control commands for navigation by an agent in an environment. More specifically, the system of the present disclosure implemented a deep unsupervised model predictive control architecture for visual servoing in 6-DoF. Through experimental results, it has been demonstrated that the method of the present disclosure can generate optimal control commands in any environment (e.g., continuous state space-based environment) which can adapt and perform well despite an inaccurate understanding of the underlying system dynamics. Its ability to be trained in an online manner makes it easy to adapt well in a completely unknown environment (or new or unseen environments).
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined herein and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the present disclosure if they have similar elements that do not differ from the literal language of the embodiments or if they include equivalent elements with insubstantial differences from the literal language of the embodiments described herein.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

,CLAIMS:
1. A processor implemented method for generating a plurality of control commands for navigation by an agent in an environment, comprising:
obtaining, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to an action space-based environment (202);
iteratively performing, until a predicted optical loss reaches a threshold:
receiving, via the first flow network executed by the one or more hardware processors, a current image corresponding to the action space-based environment and generating a desired optical flow based on the current image and the goal image, wherein a neural network is trained using the desired optical flow to obtain a trained neural network (204);
receiving, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the action space-based environment and generating a current scene depth information thereof (206);
generating, via a kinetic model executed by the one or more hardware processors, a predicted optical flow, using the current scene depth information (208);
performing a comparison of the predicted optical flow and the desired optical flow to obtain the predicted optical loss (210); and
generating, via the trained neural network (RLS) executed by the one or more hardware processors, one or more optimal control commands based on (i) the predicted optical loss and (ii) one or more previously executed control commands being initialized (212).

2. The processor implemented method of claim 1, wherein the action space-based environment is at least one of a discrete action space environment and a continuous action space environment.

3. The processor implemented method of claim 1, wherein the step of generating the current scene depth information thereof is preceded by:
generating a proxy flow using (i) the previous image and (ii) the current image corresponding to the action space-based environment; and
converting the proxy flow to the current scene depth information.

4. The processor implemented method of claim 1, wherein the predicted optical loss is indicative of a difference between the predicted optical flow and the desired optical flow.

5. The processor implemented method of claim 1, wherein when the predicted optical loss reaches the threshold, the current image substantially matches the goal image.

6. A system (100) for generating a plurality of control commands for navigation by an agent in an environment, comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
obtain, via a first flow network executed by the one or more hardware processors, (i) a goal image corresponding to an action space-based environment;
iteratively perform, until a predicted optical loss reaches a threshold:
receiving, via the first flow network, a current image corresponding to the action space-based environment and generating a desired optical flow based on the current image and the goal image, wherein a neural network is trained using the desired optical flow to obtain a trained neural network;
receiving, via a second flow network, (i) a previous image and (ii) the current image corresponding to the action space-based environment and generating a current scene depth information thereof;
generating, via a kinetic model, a predicted optical flow, using the current scene depth information;
performing a comparison of the predicted optical flow and the desired optical flow to obtain the predicted optical loss; and
generating, via the trained neural network, one or more optimal control commands based on (i) the predicted optical loss and (ii) one or more previously executed control commands being initialized.

7. The system of claim 6, wherein the action space-based environment is at least one of a discrete action space environment and a continuous action space environment.

8. The system of claim 6, wherein the current scene depth information is generated by:
generating a proxy flow using (i) the previous image and (ii) the current image corresponding to the action space-based environment; and
converting the proxy flow to the current scene depth information.

9. The system of claim 6, wherein the predicted optical loss is indicative of a difference between the predicted optical flow and the desired optical flow.

10. The system of claim 6, wherein when the predicted optical loss reaches the threshold, the current image substantially matches the goal image.

Documents

Application Documents

#	Name	Date
1	202021049527-STATEMENT OF UNDERTAKING (FORM 3) [12-11-2020(online)].pdf	2020-11-12
2	202021049527-PROVISIONAL SPECIFICATION [12-11-2020(online)].pdf	2020-11-12
3	202021049527-FORM 1 [12-11-2020(online)].pdf	2020-11-12
4	202021049527-DRAWINGS [12-11-2020(online)].pdf	2020-11-12
5	202021049527-DECLARATION OF INVENTORSHIP (FORM 5) [12-11-2020(online)].pdf	2020-11-12
6	202021049527-FORM 3 [22-03-2021(online)].pdf	2021-03-22
7	202021049527-FORM 18 [22-03-2021(online)].pdf	2021-03-22
8	202021049527-ENDORSEMENT BY INVENTORS [22-03-2021(online)].pdf	2021-03-22
9	202021049527-DRAWING [22-03-2021(online)].pdf	2021-03-22
10	202021049527-COMPLETE SPECIFICATION [22-03-2021(online)].pdf	2021-03-22
11	202021049527-Proof of Right [05-05-2021(online)].pdf	2021-05-05
12	202021049527-FORM-26 [18-10-2021(online)].pdf	2021-10-18
13	Abstract1.jpg	2021-10-19
14	202021049527-FER.pdf	2023-09-11
15	202021049527-FER_SER_REPLY [19-12-2023(online)].pdf	2023-12-19
16	202021049527-DRAWING [19-12-2023(online)].pdf	2023-12-19
17	202021049527-COMPLETE SPECIFICATION [19-12-2023(online)].pdf	2023-12-19
18	202021049527-CLAIMS [19-12-2023(online)].pdf	2023-12-19
19	202021049527-ABSTRACT [19-12-2023(online)].pdf	2023-12-19
20	202021049527-PatentCertificate20-06-2024.pdf	2024-06-20
21	202021049527-IntimationOfGrant20-06-2024.pdf	2024-06-20

Search Strategy

1	202021049527E_09-12-2022.pdf

ERegister / Renewals

3rd: 20 Sep 2024

From 12/11/2022 - To 12/11/2023

4th: 20 Sep 2024

From 12/11/2023 - To 12/11/2024

5th: 20 Sep 2024

From 12/11/2024 - To 12/11/2025

6th: 06 Nov 2025

From 12/11/2025 - To 12/11/2026