Systems And Methods For Real Time Visual Servoing Using A

< Back

Systems And Methods For Real Time Visual Servoing Using A Differentiable Model Predictive Control Framework

Abstract: This disclosure relates to the field of visual servoing and more specifically to systems and methods for real time visual servoing using a differentiable model predictive control framework. Conventionally, process of visual servoing is computationally intractable in real time, making it difficult to deploy in real-world scenarios. The method of the present disclosure demonstrates significant improvement in the total servoing time as compared to conventional visual servoing approaches in 6 Degree of freedom (DoF). In the present disclosure, efficiency of a control generation architecture is showcased which uses a slim neural network architecture for training process and a differential cross entropy method providing an adaptive multivariate gaussian distribution based sampling to adaptively generate optimal control in real-time without making a heavy compromise on its performance. Controller showcases an ability to train online that helps it generalize and adapt well to unseen environments.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

30 September 2021

Publication Number

13/2023

Publication Type

INA

Invention Field

ELECTRONICS

Status

Email

ip@legasis.in

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India

Inventors

1. KUMAR, Gourav

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata – 700160, West Bengal, India

2. BHOWMICK, Brojeshwar

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata – 700160, West Bengal, India

3. KATARA, Pushkal

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad – 500032, Telangana, India

4. PANDYA, Harit

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad – 500032, Telangana, India

5. GUPTA, Abhinav

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad – 500032, Telangana, India

6. SANCHAWALA, AadilMehdi JavidHusen

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad – 500032, Telangana, India

7. QURESHI, Mohammad Nomaan

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad – 500032, Telangana, India

8. KRISHNA, Krishnan Madhava

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad – 500032, Telangana, India

9. YADARTH, Venkata Sai Harish

International Institute of Information Technology, Professor CR Rao Rd, Gachibowli, Hyderabad – 500032, Telangana, India

Specification

DESC:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
SYSTEMS AND METHODS FOR REAL TIME VISUAL SERVOING USING A DIFFERENTIABLE MODEL PREDICTIVE CONTROL FRAMEWORK

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no. 202121044482, filed on September 30, 2021. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD
The disclosure herein generally relates to the field of visual servoing, and, more particularly, to systems and methods for real time visual servoing using a differentiable model predictive control framework.

BACKGROUND
Visual servoing in robotics refers to use of feedback information from a vision sensor for navigating a robot to a desired location. This involves generating a set of actions that moves the robot in response to an observation from an image capturing device or the vision sensor, in order to reach a goal configuration in the world. The objective for visual servoing is to minimize a difference between features extracted from a current and a desired image. This objective is achieved by a visual servoing controller which iteratively minimizes an error indicative of the difference between features extracted from the current and the desired image. Thus, feature extraction and controller designs are an integral part of visual servoing approaches. However, attaining precise alignment for unseen environments pose a challenge to existing visual servoing approaches. While few conventional approaches generalize well to unseen environments and are capable of incorporating dynamic constraints, but are computationally intractable in real time, making it difficult to deploy in real-world scenarios.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method is provided. The method comprising obtaining, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to a scene in an environment; iteratively performing, until a predicted optical flow loss reaches a pre-defined threshold: receiving, via the first flow network executed by the one or more hardware processors, a current image corresponding to the scene in the environment; generating, via the one or more hardware processors, a target optical flow based on the current image and the goal image; receiving, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the scene in the environment and generating a proxy flow depth information thereof, wherein the proxy flow depth information is normalized using a flow depth normalization layer to obtain a normalized flow depth information; generating, via a kinetic model executed by the one or more hardware processors, a set of predicted optical flows, using the normalized flow depth information; comparing each predicted optical flow from the set of predicted optical flows with the target optical flow to obtain a set of predicted optical flow losses; and adaptively generating, via a control generator executed by the one or more hardware processors, one or more optimal control commands by optimizing (i) the set of predicted optical flow losses and (ii) one or more previously executed control commands being initialized, wherein the one or more optimal commands are adaptively generated using a neural network trained using one or more control parameters.
In another aspect, a system is provided. The system comprising a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to obtain, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to a scene in an environment; iteratively perform, until a predicted optical flow loss reaches a pre-defined threshold: receive, via the first flow network executed by the one or more hardware processors, a current image corresponding to the scene in the environment; generate, via the one or more hardware processors, a target optical flow based on the current image and the goal image; receive, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the scene in the environment and generating a proxy flow depth information thereof, wherein the proxy flow depth information is normalized using a flow depth normalization layer to obtain a normalized flow depth information; generate, via a kinetic model executed by the one or more hardware processors, a set of predicted optical flows, using the normalized flow depth information; compare each predicted optical flow from the set of predicted optical flows with the target optical flow to obtain a set of predicted optical flow losses; and adaptively generate, via a control generator executed by the one or more hardware processors, one or more optimal control commands by optimizing (i) the set of predicted optical flow losses and (ii) one or more previously executed control commands being initialized, wherein the one or more optimal commands are adaptively generated using a neural network trained using one or more control parameters.
In yet another aspect, a non-transitory computer readable medium is provided. The non-transitory computer readable medium comprising obtaining, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to a scene in an environment; iteratively performing, until a predicted optical flow loss reaches a pre-defined threshold: receiving, via the first flow network executed by the one or more hardware processors, a current image corresponding to the scene in the environment; generating, via the one or more hardware processors, a target optical flow based on the current image and the goal image; receiving, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the scene in the environment and generating a proxy flow depth information thereof, wherein the proxy flow depth information is normalized using a flow depth normalization layer to obtain a normalized flow depth information; generating, via a kinetic model executed by the one or more hardware processors, a set of predicted optical flows, using the normalized flow depth information; comparing each predicted optical flow from the set of predicted optical flows with the target optical flow to obtain a set of predicted optical flow losses; and adaptively generating, via a control generator executed by the one or more hardware processors, one or more optimal control commands by optimizing (i) the set of predicted optical flow losses and (ii) one or more previously executed control commands being initialized, wherein the one or more optimal commands are adaptively generated using a neural network trained using one or more control parameters.
In accordance with an embodiment of the present disclosure, the one or more control parameters include an optimized mean value of a subset of the velocities corresponding to an optimized set of predicted optical flow losses used to adaptively update one or more parameters of a multivariate gaussian distribution until the predicted optical flow loss reaches the pre-defined threshold.
In accordance with an embodiment of the present disclosure, the adaptively generated one or more optimal control commands are used to perform visual servoing in real time.
In accordance with an embodiment of the present disclosure, when the predicted optical flow loss reaches the pre-defined threshold, the current image matches the goal image.
In accordance with an embodiment of the present disclosure, the one or more control parameters are obtained based on a differential cross entropy method providing an adaptive multivariate gaussian distribution based sampling.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary system for real time visual servoing using a differentiable model predictive control framework according to some embodiments of the present disclosure.
FIG. 2 is a functional block diagram for real time visual servoing using a differentiable model predictive control framework according to some embodiments of the present disclosure.
FIG. 3 illustrates an exemplary flow diagram illustrating a method for real time visual servoing using a differentiable model predictive control framework in accordance with some embodiments of the present disclosure.
FIG. 4 is a functional block diagram of an optimal control generation architecture for real time visual servoing using a differentiable model predictive control framework according to some embodiments of the present disclosure.
FIG. 5 provides a comparison of conventional approaches with the method of present disclosure in terms of servoing frequency for a specific environment captured using an image capturing device according to some embodiments of the present disclosure.
FIGS. 6A and 6B depict a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to Photometric Error and time for two different environments, in accordance with an embodiment of the present disclosure.
FIG. 7 provides a photometric error image representation computed between the goal and attained images on termination for six environments in a simulation benchmark comprising the method of the present disclosure and conventional approaches.
FIG. 8A provides a comparison of conventional approaches with the method of present disclosure in terms of optical flow loss and Photometric Error with reference to number of model predicted control (MPC) iterations in accordance with some embodiments of the present disclosure.
FIG. 8B depicts a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to velocity and number of iterations, in accordance with an embodiment of the present disclosure.
FIGS. 9A and 9B depict a graphical representation illustrating performance comparison of the method of the present disclosure and conventional approaches with reference to camera trajectories, in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following embodiments described herein.
Visual servoing refers to use of feedback information from one or more sensors such as vision sensor for navigating an agent (e.g., robot, drone, unmanned aerial vehicle, and/or the like) to a desired location form an initial location. This involves generating a set of actions that moves the agent in response to an observation from the vision sensor such as a camera, in order to reach a goal configuration in the world. Minimizing difference between features extracted from a current and a desired image is the fundamental objective behind most of the visual servoing problems. This objective is achieved by a visual servoing controller which iteratively minimizes this difference as an error. Thus, feature extraction and controller designs are the important modules for visual servoing approaches. Conventionally, hand-crafted features such as points, lines and contours are employed for visual servoing. However, such appearance based features result in inaccurate matching for larger camera transformations. To circumvent this bottleneck, recent data driven visual servoing approaches resort to deep neural features. Few learning based approaches were able to achieve a sub-millimeter precision by learning relative camera pose from a pair of images in a supervised fashion. However, these approaches fail to generalize to new environment as they are over-trained on a single environment.
Although, few conventional approaches are trained on multiple environments and are generalized to a certain extent, they require supervision and cannot be trained on-the-fly. Recently, a principled approach attempted to combine deep flow features with a classical image based controller to improve generalization to novel environments without any retraining. Since controller design is another crucial aspect of visual servoing, conventionally, image based and position based controllers are used which take actions greedily and do not consider a long term horizon, thereby lead to problems such as the loss of features from field of view, getting stuck in local minima and larger trajectory lengths. Thus, to achieve a better controller performance, optimal control and model predictive control (MPC) based formulations are proposed in the visual servoing domain. MPC based formulations of visual servoing provide advantages to cater additional constraints such as robot dynamics, field of view and obstacle avoidance. Although conventional MPC based formulations of visual servoing are lucrative, they require accurate feature correspondences and do not scale up to high dimensional features such as images or rows. These challenges make it difficult for conventional MPC based visual servoing approaches to be employed for modern deep features.
There exist several deep reinforcements learning based visual navigation approaches that present a neural path planning framework for visual goal reaching. However, such model free end-to-end learning based approaches are sample inefficient, thus do not scale to higher dimensional continuous actions and face difficulty while generalizing to new environments. Few conventional single step methods are faster in computing immediate control. However, these approaches do not utilize advantages of the MPC optimization formulation. A very recent conventional approach (e.g., refer ‘P. Katara, Y. V. S. Harish, H. Pandya, A. Gupta, A. Sanchawala, G. Kumar, B. Bhowmick, and K. M. Krishna, “Deepmpcvs: Deep model predictive control for visual servoing,” 4th Annual Conference on Robot Learning (CoRL), 2020. [Online]. Available: https://corlconf.github.io/paper 448/’) employs deep flow features and use a kinematic model based on flow. This conventional approach solves deep MPC formulation using a recurrent neural network (RNN) based optimization routine that generates velocities to optimize their flow based loss. Although, this conventional approach solves the MPC problem on-the-fly in a receding horizon fashion and achieve convergence but requires training of a recurrent neural network online which is computationally expensive with respect to total servoing time. This narrows down the scope of performance of the conventional approach in real world scenarios.
Embodiments of the present disclosure provide systems and methods for real time image based visual servoing (IBVS) using a lightweight MPC framework which generates control in real time and can be trained online in an unsupervised fashion. A directional neural network is used to encode relative pose between a current image and a goal image in its parameters, thereby significantly reducing number of iterations required in each MPC steps. Further, a differential cross entropy method (DCEM) is used for differentiable sampling of control. The controller utilized in the method of present disclosure attains a desired pose around ten times faster than the recent state of the art approach (e.g., Deepmpcvs approach) with a 74% reduction in servoing time. This decrease in the servoing frequency leads to a significantly shorter delay between successive control commands, leading to minimal number of jerks, which is crucial for aerial robots. An optimal control generation architecture is used which is sample-efficient and improves the servoing rate without affecting performance. Furthermore, a flow normalization layer is used in the method of the present disclosure which reduces error in flow depth estimates, thereby accounting for inaccurate flow predictions. More Specifically, the present disclosure describes the following:
Implementation of a novel, lightweight and real time visual servoing framework formulated as a model prediction control (MPC) optimization process which is trained on-the-fly in an unsupervised fashion. A significant increase in servoing frequency and computation of control near real time at 0.095 seconds per MPC iteration is achieved.
A control generator network providing decrease in time using a slim and lightweight neural network is disclosed by significantly reducing number of iterations required in each MPC step. A differential cross-entropy method is used that performs adaptive sampling and generate optimal control commands in 6 degree of freedom (DoF). A flow normalization layer, which accounts for inferior flow predictions from flow network thereby reducing network's dependency on the accuracy of the flow.
In other words, the present disclosure provides a lightweight control generation architecture which effectively samples candidate velocities, leading to a significant improvement in the control generation time and hence, the reduction in the servoing time. In an embodiment, the candidate velocities are the velocities generated using a kinetic model. The methods of the present disclosure achieve right trade-off between the servoing rate and performance (attainment of precise alignments) through an intelligent sampling strategy and a slim neural network architecture trained on-the-fly. Depth of a scene (Scene depth) captured by an agent using an image capturing device is computed. Scene depth Z_t is computed as an inverted scale representation of magnitude of an optical flow between a current image I_t and previous image I_(t-1) captured using the image capturing device, and passed through flow-depth normalization layer to generate effective scene depth from the optical flow.
Referring now to the drawings, and more particularly to FIGS. 1 through 9B, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary system 100 for real time visual servoing using a differentiable model predictive control framework according to some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is/are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W 5 and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 comprises current image, previous image, and goal image corresponding to a scene in an environment. The database 108 further stores information on the scene in the environment
The database 108 further stores information on target optical flow, proxy optical flow, proxy flow depth information, normalized flow depth information, predicted optical flow loss, control commands. information stored in the database 108 further comprises one or more previously executed control commands being initialized.
The database 108 further comprises one or more networks such as one or more flow networks, one or more neural network(s) which when invoked and executed perform corresponding steps/actions as per the requirement by the system 100 to perform the methodologies described herein. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
FIG. 2, with reference to FIG. 1, depicts a block diagram of an architecture as implemented by the system of 100 of FIG. 1 for real time visual servoing using a differentiable model predictive control framework, in accordance with an embodiment of the present disclosure.
FIG. 3, with reference to FIGS. 1-2, depicts an exemplary flow chart illustrating a method 200 for real time visual servoing using a differentiable model predictive control framework, using the system 100 of FIG. 1, in accordance with an embodiment of the present disclosure.
Referring to FIG. 4, in an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to components of the system 100 of FIG. 1, the block diagram of FIG. 2, the flow diagram as depicted in FIG. 3 and the block diagram of FIG. 4. In an embodiment, at step 202 of the present disclosure, the one or more hardware processors 104 obtain, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to a scene in an environment. As depicted in the block diagram of FIG. 2, the first flow network (e.g., refer flow network that is in the first row of the block diagram) receives the goal image (I^*)
In an embodiment, at step 204 of the present disclosure, the one or more hardware processors 104 (i) receive, via the first flow network executed by the one or more hardware processors, a current image corresponding to the scene in the environment. As depicted in the block diagram of FIG. 2, the first flow network (e.g., refer flow network that is in the first row of the block diagram) receives the current image (I_t) at each iteration using a sensor (e.g., visual sensor) or image capturing device (e.g., monocular camera) comprised in the agent (e.g., a robot, an unmanned aerial vehicle (UAV), and/or the like.
In an embodiment, at step 206 of the present disclosure, the one or more hardware processors 104 generate a target optical flow based on the current image and the goal image. In an embodiment, at step 208 of the present disclosure, the one or more hardware processors 104 receive, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the scene in the environment and proxy flow depth information thereof. As depicted in the block diagram of FIG. 2, the second flow network (e.g., refer flow network that is in the third row of the block diagram) receives the current image (I_t) and the previous image (I_(t-1)) as input at each iteration and generates the proxy flow depth information. In an embodiment, the proxy flow depth information is normalized using a flow depth normalization layer to obtain a normalized flow depth information.
The steps 202 till 208 are better understood by way of the following description provided as exemplary explanation.
In the present disclosure, the problem of visual servoing is considered as a target driven navigation in unseen environments using an image capturing device (e.g., a monocular camera). Given I_t, a current observation of an agent (e.g., robot, an agent, unmanned aerial vehicle (UAV), and the like), in the form of monocular Red, Green, Blue (RGB) image at any time instant ?? and a desired observation I^*. The agent comprises one or more of one or more actuators, one or more sensors and the like for performing one or more tasks such as manipulation, navigation, and the like. The one or more actuators, one or more sensors and the like may be either an integral part of the agent or externally connected to the agent via one or more I/O interfaces as known in the art. The goal is to generate optimal control commands [v_t,?_t] in 6-DoF that minimizes a photometric error e_t= ?I_t-I^* ? (also referred as predicted optical flow loss or flow loss and interchangeably used herein) between I^* and I_t. It is assumed by the system and method of the present disclosure the image capturing device is attached to the agent (also known as eye-in-hand configuration) and is calibrated.
The present disclosure employs optical flow as an intermediate visual representation for encoding differences in images (e.g., differences in the goal image and the current image serving as predicted optical flow loss/flow loss as depicted in FIG. 2). Optical flow encodes displacement for every pixel which is more relevant information for feature tracking and matching as compared to pixel intensities. As a result, optical flow has been successfully used in motion estimation and visual servoing. Furthermore, in the presence of translation camera motion in a static environment, dense flow features could also be used for estimating image depth. Although, the present disclosure does not constraint itself to any specific network for flow or depth estimation, a pre-trained neural network, Flownet 2 (e.g., refer ‘E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks, 2016.’) is implemented by the system 100 for flow estimation without any fine-tuning.
Referring to steps of FIG. 3 of the present disclosure, at step 210, the one or more hardware processors 104 generate, via a kinetic model (e.g., also referred as an interaction matrix as depicted in FIG. 2) executed by the one or more hardware processors, a set of predicted optical flows, using the normalized flow depth information. The step of 210 can be better understood by way of the following description provided as exemplary explanation.
In order to solve the fundamental IBVS objective of minimizing the photometric error e_t= ?I_t-I^* ? between I^* and I_t, the kinetic model is generated and an MPC objective is solved. The MPC objective is given as:
V_t^* =argmin-(V_t )??F(I_t,I^* )-F ^(V_t )? (1)
Where, F(I_t,I^* ) is the target optical flow between I^* and I_t and F ^(V_t ) is a pseudo-flow generated through the predictive model given by:
(F() ^(V_(t+1:t+T) )= ?_(k=1)^T¦[L(Z_t ) V_(t+k) ] (2)
where L(Z_t ) is the interaction matrix which relates rate of change of features in the camera plane to the image plane, and V_t is the optimal control. The image coordinates
x,y and Z_tbeing the depth information of the scene, the interaction matrix is generated as:
L(Z_t )=[¦((-1)/Z_t &0@0&(-1)/Z_t ) ¦(x/Z_t &xy@y/Z_t &1+y^2 ) ¦(-(1+x^2 )&y@-xy&-x)] (3)
In an embodiment, inferior flow predictions are provided using conventional two view depth estimation methods, since FlowNetC is used as the first and the second flow network without any retraining/fine-tuning. In another conventional approach, magnitude of predicted flow values predicted has high variance from pixel to pixel due to non-planar structure of the scene and as a result, there are heavy outliers while using the predicted flow as a proxy for depth. These outliers affect the performance of the controller of the present disclosure, because magnitude of velocity is very sensitive to depth values. Hence, a stable flow depth is required. A sigmoid function as provided in below equation is used to scale and normalize the flow values before being fed to the kinetic model.
Z_(s ) (x,y)=v(1/(1+e^(-Z) )-0.5) (4)

Here, Z is the two-view depth used in conventional methods, n is scaling factor which is selected as 0.4 and Z_s is scaled inverse used as proxy for depth in the interaction matrix. Table 1 provided below illustrates effect of flow normalization on the mean squared error for three scenes. As can be seen in Table 1, a significant decrease in Mean Squared Error is observed when the flow depth is normalized.

Table 1
Scene MSE Flow depth MSE Normalized Flow depth
Quantico 160.04 49.72
Arkansaw 469.72 63.44
Ballou 508.81 122.78
In an embodiment, at step 212 of the present disclosure, the one or more hardware processors 104 compare each of the set of predicted optical flows with the target optical flow to obtain a set of predicted optical flow losses. Each predicted optical flow loss indicates a difference between the predicted optical flow and the target optical flow, in one embodiment of the present disclosure.
In an embodiment, at step 214 of the present disclosure, the one or more hardware processors 104 are configured to adaptively generate, via a control generator executed by the one or more hardware processors, one or more optimal control commands by optimizing (i) the set of predicted optical flow losses and (ii) one or more previously executed control commands being initialized. In an embodiment, the one or more optimal commands are generated using a neural network trained using one or more control parameters. In an embodiment, the one or more control parameters are obtained based on a differential cross entropy method providing an adaptive multivariate gaussian distribution based sampling. The one or more optimal control commands are then executed (e.g., executed either by the agent or any external or internal computing device connected to the agent) for manipulating of the agent or for navigation of the agent in the environment. In an embodiment, the system 100 performing the above steps of the method depicted in FIG. 2 and 3 may be either an integral part of the agent (e.g., drone, robot, UAV, and the like) or externally attached/communicatively connected to the agent via one or more I/O interfaces (e.g., input/output interfaces as known in the art). The optimal control commands comprise, but are not limited to, direction at which the agent should move, action(s) that the agent needs to perform, and the like. The steps 212 and 214 can be better understood by way of the following description provided as exemplary explanation.
The predictive model shown in eq. 2 is modified in the present disclosure to achieve an optimal trade-off between speed and accuracy, by connecting the MPC objective (which is to minimize the error between the predicted optical flow and target optical flow) with a differential sampling based strategy through a slim and lightweight neural network. a horizon parameter 'h' is used in order to scale the predicted optical flow to learn to predict the mean optical flow, rather than predicting over a time-horizon as shown in equation (5)
F ^(V_t )=h*LZ_t V_t (5)
Thus, a loss function as shown in eq. 6 is formulated to train a control generation network on-the-fly.
L_flow=?F ^((V_t ) ^ )-F(I_t,I^* )? = ?[L(Z_t ) V ^ ]-F(I_t,I^* )? (6)
The pseudo flow F ^(V_t ) is regressed with the target flow F(I_t,I^* ) using a mean squared error loss.
FIG. 4, with reference to FIGS. 1-3, depicts a functional block diagram of an optimal control generation architecture for real time visual servoing using a differentiable model predictive control framework according to some embodiments of the present disclosure.
Optimal Control Generation Architecture: In an embodiment, the differential cross entropy sampling method (DCEM) is used to generate optimal control and train the neural network. In an embodiment, the DCEM-based Sampling and training of the neural network can be better understood by way of the following description provided as exemplary explanation:
Intelligent DCEM-based Sampling: Due to the high dimensionality of flow representations, it becomes difficult for classical MPC solvers to optimize the objective function in equation 1. Conventional cross entropy method is an attractive formulation to solve complex control optimization problems involving objective functions which are non-convex, deterministic and continuous in nature by solving the equation (7)
(F=) ^ argmin-F??e_V (F)? (7)
where e_V (F) is the objective function having parameters V over the n-dimensional space. However, there are a few shortcomings to using this approach. It is a gradient-free optimisation approach where Gaussian parameters are refitted only in a single optimization step, which does not allow adaptive sampling over consecutive MPC steps i.e. it is unable to preserve the memory of effective sampling. Moreover, the parameter optimization is not based on the downstream task's objective function which might lead to a sub-optimal solution, which may lead to an increase in the number of MPC steps, directly affecting the servoing time. In the present disclosure, the differential cross entropy method (DCEM) is used which parameterizes the objective function Ev(F), making it differentiable with respect to v. Further, a non-linearity is introduced in the control generation process of the present disclosure with addition of the neural network, in order to retain information throughout all MPC steps. This connects the sampling strategy with the objective function, making CEM an end-to-end learning process. Hence, the Gaussian parameters are updated with subsequent MPC optimization steps, enabling adaptive sampling over the process based on the MPC objective. This allows the present disclosure to sample from a latent space of more reasonable control solutions.
2) DirectionNN: In the present disclosure a slim neural network called the 'DirectionNN', whose design aims to encode a sense of relative pose between the current image I_(t )and the goal image I^* is used. The neural network is kept lightweight in order to achieve a high servoing rate and incorporate online learning of relative pose updates in each MPC optimization step. Since small steps are executed in each iteration, the neural network is capable of learning change in relative pose quickly. Moreover, weights of the neural network are also retained in each MPC iteration, which can be reused since there is a minimal change in the agent's image measurements. By multiplying with the horizon h shown in (eq. 5), merit of direction predicted by the DirectionNN over an extended time is checked. Hence, the neural network in the present disclosure is trained for lesser number of iterations in each MPC step, which significantly decreases the total servoing time. According to an embodiment, the direction neural network architecture has an input layer consisting of 6 neurons followed by 4 fully connected hidden layers with 16, 32, 256, 64 neurons respectively and an output layer with 6 neurons, which represents the 6DoF velocity vector. A ReLU activation is applied at the output of each hidden layer and Sigmoid activation on the last layer. The 6-D output is further scaled between -1 and 1 to vectorize it as a 6DoF
velocity vector. The direction NN is trained in each MPC step. The inputs to the direction NN are intelligently sampled from a Gaussian distribution.
Sampling and Learning: In the present disclosure, DCEM sampling is used along with the DirectionNN to generate optimal control. In an embodiment, the one or more control parameters used to train the neural network include an optimized mean value of a subset of the velocities corresponding to an optimized set of predicted optical flow losses used to adaptively update one or more parameters of a multivariate gaussian distribution until the predicted flow loss reaches the pre-defined threshold. This can be further illustrated by way of the following description provided as exemplary explanation with the help of FIG. 4. As shown in FIG. 4, in each MPC optimization step, the control generation network samples a ß (batch size) x 6 dimensional vector from a Gaussian distribution g_f (µ,s^2 )and carries out a forward pass, generating b samples of 6 DoF velocity commands. a pixel-wise target flow F(I_t,I^* ) is computed between I_t and I^* using a pretrained FlowNetC model. Moreover, a kernel with a filter size of (7x7) and a stride of 7 is applied to the target flow F(I_t,I^* ) and pseudo-flow F ^(V_t ). The kernel K consists of only one nonzero value with K[0,0]=1. The weights of the Direction NN are updated through gradient descent. An Adam optimizer is used with a learning rate of 0.005 to train Direction NN. The sampling parameters of the Gaussian distribution g_f (µ,s^2 )are optimized for subsequent steps. µ,s^2are updated by the mean and variance of top K velocities corresponding to the top K least flow errors. For example, in the method of present disclosure, 8 velocities are sampled and the flow loss is computed for each, and the velocity corresponding to the least few loss is selected and applied to the agent. Further, the horizon parameter is multiplied with the generated velocity before computing the flow loss. The MPC optimization steps are repeated until the convergence criteria to reach the goal location is met.
In an embodiment, the adaptively generated one or more optimal control commands are used to perform visual servoing in real time. In an embodiment of the present disclosure, the steps 204 till 214 are iteratively performed until the predicted optical flow loss reaches a threshold (e.g., a dynamically determined threshold, an empirically determined threshold, or a pre-defined threshold). In other words, until the current image matches the goal image, the steps 204 till 214 are iteratively performed by the system 100. Once the predicted optical flow loss reaches the threshold, the current image tends to substantially match the goal image. For instance, assuming that the threshold is x% loss or at least ‘y%’ match, wherein values of ‘x’ and ‘y’ can vary depending upon the implementation of the system for a given environment type. The threshold as mentioned and/or expressed in terms of ‘x%’ and/or ‘y%’ may also be referred as convergence threshold and may be interchangeably used herein, in one embodiment of the present disclosure. For sake of brevity, assuming that the threshold is 5% loss or 95% matching threshold. In such scenarios, the steps 204 till 214 are iteratively performed until the predicted optical flow loss between each predicted optical flow from the set of predicted optical flows and the target optical flow is reduced to 5%. In other words, the difference in the scene information of the current image when compared with the goal image shall result in 5%. Once this is achieved, it is presumed that the agent has more or less has reached the goal image or the given goal (or the objective). In other embodiment, the difference in the scene information of the current image when compared with the goal image shall result in less than or equal to x% (e.g., =5%). This also means that there is at least a 95% match of the scene information of the current image when compared to that the scene information of the goal image. In other embodiment, the scene information of the current image when compared with the goal image shall result in a match greater than or equal to y% (e.g., =95%). More specifically, scene information comprised in the current image substantially matches scene information comprised in the goal image that is specific to the corresponding scene in the environment. It is to be understood by a person having ordinary skill in the art or person skilled in the art that the value of ‘x’ and ‘y’ can take any value depending upon the implementation and environment type and such values of ‘x’ and ‘y’ shall not be construed as limiting the scope of the present disclosure. In an embodiment, once the environment is detected/identified, the threshold may be dynamically determined and configured in real time. In another embodiment, the threshold may be pre-defined and configured by the system 100.
The entire approach/method of the present disclosure can be further better understood by way of following pseudo code provided as example:
Require: I^*, ?? Goal image, convergence threshold
1. Initialize µ, s^2 Initialize Gaussian distribution sampling parameters
2. while ?I-I^* ?=e do convergence criterion
3. I_t?get-current-obs() obtain the current RGB observation from sensor
4. Predict-target flow (F(I_t,I^* )) Predict target flow (optical flow) using flow network
5. L_t:= compute-interaction-matrix Kinetic model generation
(F(I_t-I_(t-1)))
6. for m=0?M do On the fly training for direction NN
7. [ß_i ]_(i=1)^N~g_f (µ,s^2 ) Sampling ß x 6 vector from gaussian distribution
8. ??^[V_(t,i) ]_(i=1)^N= f_(?_1 ) ([ß_i ]_(i=1)^N ) Predict ß velocities from direction NN
9. [(F() ^(V_(t,i) )]_(i=1)^N= [L_t (Z_t ) V_(t,i) ]_(i=1)^N Generate ß pseudo-flow using predictive model
10. [L_flow ]_(i=1)^N? [?(F() ^(V_(t,i) )-F(I_t,I^* )?]_(i=1)^N Compute predicted optical/flow loss
11. ?_(m+1)? ?_m--??L_flow Update directional NN parameters
12. µ_(t+1)= 1/(k?_i^k¦V_(t,i) ) Update mean of the gaussian distribution
13. ?s^2?_(t+1)= 1/(k?_i^k¦(µ_(t,i)-µ_(t+1) )^2 ) Update standard deviation of the gaussian distribution
14. end for
15. ????+1=??^??+1 Execute control command in the environment
16. end while
Experimental Results:
FIG. 5 provides a comparison of conventional approaches with the method of present disclosure in terms of servoing frequency for a specific environment captured using an image capturing device according to some embodiments of the present disclosure. As depicted in FIG. 5, the method of present disclosure (RTVS) shows a significant improvement in servoing frequency when compared with conventional deep MPC based visual servoing approaches in 6DoF, without compromising its ability to attain precise alignments. The controller of present disclosure generates optimal control in real time at a frequency of 66 Hz (excluding optical flow overheads) and successfully servos to its destination, while conventional approaches still lag behind.
In the present disclosure, an online control generation strategy has been described that can generate optimal 6DoF robot commands on-the-fly and in real time, while not compromising on performance in terms of convergence and alignment. The benchmark used in the method of present disclosure comprises 8 indoor 3D photorealistic baseline environments from the Gibson dataset in the Habitat simulation engine as known in the art. These baseline environments span various levels of difficulty based on parameters such as the extent of overlap, the amount of texture present and the rotational and translational complexities between the initial and the desired image. free-flying RGB camera has been used by the agent so that the agent can navigate 15 in all 6-DoFs without any constraint.
To validate this, the present disclosure reports quantitative and qualitative results on the benchmark. In the present disclosure, control is generated at 0.015 seconds per MPC step (excluding row overheads) and photometric convergence is attained faster than the conventional methods which optimize over a time horizon. While achieving a significant improvement in servoing rate, the method of present disclosure does not compromise on performance in terms of pose and photometric errors, which is comparable to established baselines. Although, the method of present disclosure performs marginally for its trajectory length but generates control in real time and achieves precise alignments, thereby ending right trade-off between speed and performance.
The method of the present disclosure is compared with the current state-of-the-art Deep MPCVS and MPC+NN based techniques (e.g., refer ‘P. Katara, Y. V. S. Harish, H. Pandya, A. Gupta, A. Sanchawala, G. Kumar, B. Bhowmick, and K. M. Krishna, “Deepmpcvs: Deep model predictive control for visual servoing,” 4th Annual Conference on Robot Learning (CoRL), 2020. [Online]. Available: https://corlconf.github.io/paper 448/’), both of which are MPC formulations but computationally heavy due to their bulky architecture. Conventional single-step approaches (e.g., Deep flow guided scene agnostic image based visual servoing (DFVS)) are fast in computing immediate control, but they optimize greedily and cannot incorporate additional constraints. However, high servoing rates can be achieved with such single-step approaches as well.
Convergence Study and Time Comparison
Table 2 provides a quantitative comparison of the present disclosure with conventional deep MPC based methods a conventional single-step approach. It can be seen from the Table 2 that except MPC+NN based method (prior art), all other methods attain convergence on the simulation benchmark. The method of the present disclosure reports time per servoing iteration excluding overheads (referred as Time w.o. flow in the Table 2), time per servoing iteration including flow computations (referred as Time w. flow in the Table 2), the average number of iterations taken to reach convergence (referred as Total Iters. in the Table 2) and the average time required to servo to the destination, including overheads (referred as Total Time in the Table 2).
Table 2
Approaches Time
w.o flow Time
w. flow Total iterations Total Time
MPC+NN (Prior art) 0.75 1.10 344.22 378*
DFVS (Prior art) 0.001 0.21 994.88 209
DeepMPCVS (Prior art) 0.8 1.15 569.63 655
RTVS (Method of present disclosure) 0.015 0.095 1751.25 166
It can be seen from the below Table 2, the method of the present disclosure outperforms current state of the art approach (e.g., DeepMPCVS) which is optimal and requires least number of iterations but has a costly per iteration time overhead. Whereas the method of the present disclosure has a very low per MPC-step compute time and takes least amount of time to attain convergence, comparable to the performance of state of the art method such as DFVS. In Table 2, all times are given in seconds and * indicates no convergence in some scenes
The method of the present disclosure is compared across all environments in the simulation benchmark and report per MPC-step time (per IBVS-step time in case of single step conventional approaches) with and without overheads from flow computation, the number of steps taken for convergence and the total time for a visual servoing run to reach the goal image, averaged over all scenes as shown in Table 2. The lightweight control generation architecture and intelligent sampling strategy along with the MPC formulation of the present disclosure helps achieve a per MPC-step time of 0.015 seconds (this is the time taken for the MPC optimization step, excluding the overheads from the flow network). In the method of present disclosure, training is performed only for 1 iteration in every MPC step (as opposed to 100 required in the DeepMPCVS based state of the art method) and the weights of neural network are retained. Since there is no substantial change in the velocities for immediate MPC steps, it helps lower the per MPC-step time. The entire time taken for an MPC-step, after including flow overheads, is 0.095 seconds, resulting in a near real time control generation at a frequency of 10.52 Hz. The method of present disclosure attains strict convergence of photometric error <500 and outperforms conventional deep MPC based visual servoing approaches in terms of the total servoing time and servoing rate. The photometric convergence plots are shown in FIGS. 6A and 6B. More specifically, FIGS. 6A and 6B with reference to FIGS. 1 through 5, depict a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to Photometric Error and time for two different environments, in accordance with an embodiment of the present disclosure. It is observed from FIGS. 6A and 6B that the method of the present disclosure is fastest to achieve convergence, without any compromise on convergence criteria. “Nvidia Geforce GTX 1080-Ti Pascal” GPU was used by the system and method of the present disclosure during experiments to benchmark these approaches.
Qualitative Results
The present disclosure achieves significant improvement in time while not compromising on photometric convergence. The method of the present disclosure is tested across all environments in the simulation benchmark. The controller of the present disclosure successfully achieves convergence and is able to servo to the desired location with optimal control commands. The control actions are learnt unsupervised and the network is trained on-the-fly in an unsupervised fashion. FIG. 7 provides a photometric error image representation computed between the goal and attained images on termination for six environments in a simulation benchmark comprising the method of the present disclosure and conventional approaches. The controller of the present disclosure successfully achieves convergence with a strict photometric error of <500 across all environments. It can be seen from FIG. 7 that conventional approaches such as PhotoVS and ServoNet fail to converge even with large number of iterations, thus showing large photometric errors. Although DFVS based approach successfully converges, but it is a single-step approach that does not consider a long term horizon. The current state of the art method (e.g., DeepMPCVS) meets the photometric convergence criteria, but is extremely slow since the online training of its RNN architecture is computationally expensive. The method of present disclosure achieves strict convergence with control generated in real time at a frequency of 66 Hz (excluding flow overheads), whilst optimizing over a receding horizon. It is shown in FIG. 7 that a strict photometric error of strictly <500 is attained by the method of present disclosure in all scenes and the photometric error representation computed between the attained and the desired image is reported.
Pose Error and Trajectory Lengths
The present disclosure next performs more quantitative tests. Table 3 provides a comparison of average performance of the method of present disclosure and different conventional approaches in terms of pose error and trajectory lengths across all environments in the benchmark. As shown in Table 3, initial pose error, translation error (referred as T. Error in Table 3), rotational error (referred as R. Error in Table 3) and trajectory length (referred as Traj. Length in Table 3) averaged out over all environments across the simulation benchmark is reported. It can be seen from Table 3 that very low pose errors are achieved using the method of the present disclosure which is comparable to other conventional 6DoF servoing approaches.
Table 3
Approaches T. Error
(m) R. Error
(deg.) Traj. Length
(meters)
Initial Pose Error 1.6566 21.4166 -
MPC+NN (Prior art) 0.1020 2.6200 1.1600
DeepMPCVS (Prior art) 0.1200 0.5500 1.1800
RTVS (Method of present disclosure) 0.0200 0.5900 1.732
FIG. 8A provides a comparison of conventional approaches with the method of present disclosure in terms of optical flow loss and Photometric Error with reference to number of model predicted control (MPC) iterations in accordance with some embodiments of the present disclosure. It can be seen from FIG. 8A that all important errors are being reduced concurrently. The drop in optical flow Loss indicates that the network of the present disclosure can accurately predict flows as the number of MPC iterations increase. Further, the drop in flow errors and photometric errors indicates that the agent is able to reach near the goal. Thus, the method of the present disclosure achieves precise alignments with high servoing frequency and is able to capture a strong correlation between the photometric error and the optical flow loss as shown in FIG. 8A. FIG. 8B depicts a graphical representation illustrating a performance comparison of the method of the present disclosure and conventional approaches with reference to velocity and number of iterations, in accordance with an embodiment of the present disclosure. It is shown in FIG. 8B that magnitude of velocity reduces as the agent reaches near the goal (here goal image) resulting in a smooth and stable convergence. Thus, the method of the present disclosure achieves a steady decrease in the velocities as shown in FIG. 8B.
Further, a comparison of camera trajectories taken by the agent in the other conventional multi-step ahead approaches and the method of the present disclosure is provided. FIG. 9A and 9B depict a graphical representation illustrating performance comparison of the method of the present disclosure and conventional approaches with reference to the camera trajectories, in accordance with an embodiment of the present disclosure. Grid size considered in the graphical representations is (0.2m X 0.2m X 0.2m). It is observed from FIGS. 9A and 9B that trajectory lengths are slightly inferior to the current state of the art method (e.g., Deep MPC VS) but shorter than single-step conventional approaches which signifies importance of control optimization steps executed in the method of present disclosure. Further, by making a slight compromise on the trajectory length, superior servoing frequency is achieved as compared to the time-consuming approaches such as the current state of the art method (e.g., Deep MPC VS). Thus, the method of the present disclosure achieves right trade-off between servoing rate and performance. It is observed from FIG. 9B that the lightweight controller of the present disclosure generates optimal velocity commands and is able to servo to the desired location in the presence of FDN (flow depth normalization), which effectively handles inaccuracies in the flow predictions.
Generalization to Real-World Scenarios
The method of the present disclosure is capable to be deployed to real-life drones. In order to simulate a real world setup, tests with actuation noise are performed. Robot commands are generally noisy in a real-world setup. In order to simulate such conditions, the method of the present disclosure adds a Gaussian noise with m=0 and s=0.1 (m/s for translational and rad/s for rotational) to the control commands in Habitat in all six degrees of freedom, before applying to the agent. The method of the present disclosure adapts well and converges in an average number of 1850.21 MPC steps, which is 4.28% more than those required in a noiseless setup as depicted in below Table 4.
Table 4
Approaches Time Per
MPC- steps MPC- steps
Total Time
RTVS [Noiseless]
(Method of present disclosure) 0.095 1774.28 168.53
RTVS [Induced Noise] (Method of present disclosure) 0.095 1850.22 175.77
It is observed from Table 4 that convergence is achieved even in the presence of actuation noise, thereby demonstrating the ability of the controller of the present disclosure to adapt to a real world setup. Here, all times are in seconds. The method of the present disclosure is thus capable of handling actuation noise and generalizing in real-world experiments.
Evaluating Flow Depth Normalization
Since a flow network (e.g., FlowNetC) is used in the present disclosure using a lightweight architecture and without any retraining/ finetuning which makes it vulnerable to inaccuracies from the flow network. This issue is resolved by using the flow depth normalization layer in the method of present disclosure. As depicted in FIG. 9B, the architecture proposed in the present disclosure is unable to achieve convergence and reach its destination when the flow depth is not normalized. With the use of flow depth normalization layer, the flow is stabilized and a robust performance in the servoing rate as well as accuracy is achieved.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined herein and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the present disclosure if they have similar elements that do not differ from the literal language of the embodiments or if they include equivalent elements with insubstantial differences from the literal language of the embodiments described herein.
The embodiments of present disclosure provide a lightweight and fast model predictive control framework that generates control near real-time at a frequency of 10.52 Hz. The method of the present disclosure demonstrates significant improvement in the total servoing time as compared to conventional visual servoing approaches in 6DoF. Further, efficiency of the control generation architecture is showcased, which uses a slim neural network architecture for training process and the adaptive sampling strategy to generate optimal control in real-time, without making a heavy compromise on its performance. After each MPC optimization step, the velocity is applied to the agent and new measurements from the sensor are taken, repeating the optimization process for each subsequent step until the IBVS objective is met. Ability of the controller of present disclosure to train in an online fashion helps it generalize and adapt well to unseen environments. The method of the present disclosure is fastest deep MPC based approach to visual servoing in six degrees of freedom.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated herein by the following claims.
,CLAIMS:
1. A processor implemented method, comprising:
obtaining (202), via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to a scene in an environment;
iteratively performing, until a predicted optical flow loss reaches a pre-defined threshold:
receiving (204), via the first flow network executed by the one or more hardware processors, a current image corresponding to the scene in the environment;
generating (206), via the one or more hardware processors, a target optical flow based on the current image and the goal image;
receiving (208), via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the scene in the environment and generating a proxy flow depth information thereof, wherein the proxy flow depth information is normalized using a flow depth normalization layer to obtain a normalized flow depth information;
generating (210), via a kinetic model executed by the one or more hardware processors, a set of predicted optical flows, using the normalized flow depth information;
comparing (212) each predicted optical flow from the set of predicted optical flows with the target optical flow to obtain a set of predicted optical flow losses; and
adaptively generating (214), via a control generator executed by the one or more hardware processors, one or more optimal control commands by optimizing (i) the set of predicted optical flow losses and (ii) one or more previously executed control commands being initialized, wherein the one or more optimal commands are adaptively generated using a neural network trained using one or more control parameters.

2. The processor implemented method of claim 1, wherein the one or more control parameters include an optimized mean value of a subset of the velocities corresponding to an optimized set of predicted optical flow losses used to adaptively update one or more parameters of a multivariate gaussian distribution until the predicted optical flow loss reaches the pre-defined threshold.

3. The processor implemented method of claim 1, wherein the adaptively generated one or more optimal control commands are used to perform visual servoing in real time.

4. The processor implemented method of claim 1, wherein when the predicted optical flow loss reaches the pre-defined threshold, the current image matches the goal image.

5. The processor implemented method of claim 1, wherein the one or more control parameters are obtained based on a differential cross entropy method providing an adaptive multivariate gaussian distribution based sampling.

6. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
obtain, via a first flow network executed by one or more hardware processors, (i) a goal image corresponding to a scene in an environment;
iteratively performing, until a predicted optical flow loss reaches a pre-defined threshold:
receive, via the first flow network executed by the one or more hardware processors, a current image corresponding to the scene in the environment;
generate, a target optical flow based on the current image and the goal image;
receive, via a second flow network executed by the one or more hardware processors, (i) a previous image and (ii) the current image corresponding to the scene in the environment and generating a proxy flow depth information thereof, wherein the proxy flow depth information is normlized using a flow depth normalization layer to obtain a normalized flow depth information;
generate, via a kinetic model executed by the one or more hardware processors, a set of predicted optical flows, using the normalized flow depth information;
compare each predicted optical flow from the set of predicted optical flows with the target optical flow to obtain a set of predicted optical flow losses; and
adaptively generate, via a control generator executed by the one or more hardware processors, one or more optimal control commands by optimizing (i) the set of predicted optical flow losses and (ii) one or more previously executed control commands being initialized.

7. The system of claim 6, wherein the one or more control parameters include an optimized mean value of a subset of the velocities corresponding to an optimized set of predicted optical flow losses used to adaptively update one or more parameters of a multivariate gaussian distribution until the predicted optical flow loss reaches the pre-defined threshold.

8. The system of claim 6, wherein the adaptively generated one or more optimal control commands are used to perform visual servoing in real time.

9. The system of claim 6, wherein when the predicted optical flow loss reaches the pre-defined threshold, the current image matches the goal image.

10. The system of claim 6, wherein the one or more control parameters are obtained based on a differential cross entropy method providing an adaptive multivariate gaussian distribution based sampling.

Documents

Application Documents

#	Name	Date
1	202121044482-STATEMENT OF UNDERTAKING (FORM 3) [30-09-2021(online)].pdf	2021-09-30
2	202121044482-PROVISIONAL SPECIFICATION [30-09-2021(online)].pdf	2021-09-30
3	202121044482-FORM 1 [30-09-2021(online)].pdf	2021-09-30
4	202121044482-DRAWINGS [30-09-2021(online)].pdf	2021-09-30
5	202121044482-DECLARATION OF INVENTORSHIP (FORM 5) [30-09-2021(online)].pdf	2021-09-30
6	202121044482-Proof of Right [02-03-2022(online)].pdf	2022-03-02
7	202121044482-FORM 3 [22-03-2022(online)].pdf	2022-03-22
8	202121044482-FORM 18 [22-03-2022(online)].pdf	2022-03-22
9	202121044482-ENDORSEMENT BY INVENTORS [22-03-2022(online)].pdf	2022-03-22
10	202121044482-DRAWING [22-03-2022(online)].pdf	2022-03-22
11	202121044482-COMPLETE SPECIFICATION [22-03-2022(online)].pdf	2022-03-22
12	202121044482-FORM-26 [14-04-2022(online)].pdf	2022-04-14
13	Abstract1.jpg	2022-05-17
14	202121044482-FER.pdf	2023-12-28
15	202121044482-FER_SER_REPLY [03-05-2024(online)].pdf	2024-05-03
16	202121044482-COMPLETE SPECIFICATION [03-05-2024(online)].pdf	2024-05-03
17	202121044482-CLAIMS [03-05-2024(online)].pdf	2024-05-03
18	202121044482-US(14)-HearingNotice-(HearingDate-08-09-2025).pdf	2025-08-22
19	202121044482-FORM-26 [04-09-2025(online)].pdf	2025-09-04
20	202121044482-Correspondence to notify the Controller [04-09-2025(online)].pdf	2025-09-04
21	202121044482-Written submissions and relevant documents [22-09-2025(online)].pdf	2025-09-22

Search Strategy

1	ssE_27-12-2023.pdf
2	202121044482_SearchStrategyAmended_E_202121044482AE_04-03-2025.pdf