Stereo Image Views Based Path Planning For Reinforcement Learning

< Back

Stereo Image Views Based Path Planning For Reinforcement Learning Agents In A Continuous Action Space

Abstract: Manipulating objects in real world dynamic scenarios is one of the most challenging problems in Robotics, wherein performance has not been achieved anywhere near human level performance. Embodiments of the present disclosure provide systems and methods that implement a deterministic policy based actor-critic learning framework to encode path planning strategy irrespective of a robot pose and target object position. This reinforcement learning (RL) agent solely uses two different views of the environment to learn about path planning in order to reach a given target from a random pose, instead of relying on a depth sensor, wherein action values are torques applied to RL agent’s joints which are defined on the relative distance between an end effector and target object. In episodic learning framework, an actor-critic network learns optimal actions in the continuous space of real numbers for a given state configuration by trying to increase an expected action utility value.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

18 October 2017

Publication Number

17/2019

Publication Type

INA

Invention Field

PHYSICS

Status

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2023-12-28

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. VIG, Lovekesh

Tata Consultancy Services Limited, Block C, Kings Canyon ASF Insignia Gurgaon - Faridabad Road, Gawal Pahari, Gurgaon- 122003, Haryana, India

2. PAUL, Somdyuti

Tata Consultancy Services Limited, Galaxy IT Park, Block A, Industrial Area, Sector 62, Noida - 201309, Uttar Pradesh, India

Specification

Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of Invention:
STEREO IMAGE VIEWS BASED PATH PLANNING FOR REINFORCEMENT LEARNING AGENTS IN A CONTINUOUS ACTION SPACE

Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
The disclosure herein generally relate to robotic path planning, and, more particularly, to stereo image views based path planning for reinforcement learning agents in an unstructured 3D environment having a continuous action space.

BACKGROUND
One of the most important tasks for Autonomous Robotics is the ability to manipulate objects in real world unstructured environments. Traditional path planning for robotic manipulators requires precise location of the target object in the environment based on which inverse kinematics return the required joint-angles for approaching the object. This limits their use in real domains with dynamic relative positions of objects not being readily available. These approaches of path planning rely on feature matching and triangulation for object localization which may be error prone and inaccurate.
Recent work on deep reinforcement learning for manipulation appear to be more successful in adapting to different target object positions. Reinforcement learning (RL) has enjoyed a prominent place in the field of machine learning and artificial intelligence as a powerful framework for teaching intelligent actions to an agent in an unknown or unstructured environment, in order to maximize a notion of cumulative reward. However, traditional reinforcement learning necessitates manual identification of suitable features to represent the state of the system.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, a processor implemented method for stereo image views based path planning for reinforcement learning agents in an unstructured 3D environment having a continuous action space is provided. The method comprising: receiving, via one or more hardware processors a first stereo image from a first Red Green Blue (RGB) image capturing device and a second stereo image from a second RGB image capturing device, wherein the first stereo image and the second stereo image are specific to the unstructured 3D environment having the continuous action space, wherein view of the first stereo image is unique from view of the second stereo image, and wherein each of the first stereo image and the second stereo image comprise one or more target objects; generating via the one or more hardware processors, by using the first stereo image and the second stereo image, a path for a trained Reinforcement Learning (RL) agent in the continuous action space to reach a target object from the one or more target objects; tracking (or monitoring) via the one or more hardware processors, in the continuous action space, a path of the trained RL agent taken to reach the target object; generating via the one or more hardware processors, by using a deep deterministic policy gradient technique, an action policy for the tracked path (or monitored path) taken by the trained RL agent to reach the target object; and assigning via the one or more hardware processors, one or more action utility values for each action performed by the trained RL agent to reach the target object based on the generated action policy, wherein the step of assigning the one or more action utility values comprises: computing via the one or more hardware processors, by using the generated action policy, one or more distance parameters based on a distance between centroid of an end effector of the trained RL agent and the target object, and assigning via the one or more hardware processors, using the one or more distance parameters, one of a positive action utility value or a negative action utility value to each of the action being performed based on a distance information of the target object estimated based on the view associated with the first stereo image and the second stereo image and a comparison of (i) an action performed by the end effector of the trained RL agent to reach the target object and (ii) a pre-defined action pertaining to the one or more pre-stored action policies stored in a neural network, wherein the neural network is trained using a deep deterministic policy gradient learning technique, and wherein the one or more action pre-stored action policies are learnt by updating an actor critic network path specific to the unstructured 3D environment during training of the RL agent.
In an embodiment, a positive action utility value is assigned to the trained RL agent for an applied action (or a current action) when distance between the end effector and the target object is less than or equal to a pre-defined threshold. In another embodiment, a negative action utility value is assigned to the trained RL agent for an applied action (or current action) when distance between the end effector and the target object is greater than a pre-defined threshold (a current pixel wise distance).
In an embodiment, the method may further comprise training of the RL agent by creating one or more composite images from a first set of stereo images captured by a first image capturing device and a second set of stereo images captured by a second image capturing device, wherein view of each image from the first set of stereo images is unique from view of each image from the second set of stereo images and comprises the unstructured 3D environment. In an embodiment, the step of creating one or more composite images may comprise: resizing an input dimension pertaining to the first set of stereo images and the second set of stereo images to obtain a set of resized images, converting the set of resized images to one or more YUV color space images, extracting Y channel from the one or more YUV color space images to obtain one or more extracted Y channels, and concatenating one or more vertical dimensions of the one or more extracted Y channels to create the one or more composite images.
In another aspect, a system for stereo image views based path planning for reinforcement learning agents in an unstructured 3D environment having a continuous action space is provided. The system comprising: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a first stereo image from a first Red Green Blue (RGB) image capturing device and a second stereo image from a second RGB image capturing device, wherein the first stereo image and the second stereo image are specific to the unstructured 3D environment having the continuous action space, wherein view of the first stereo image is unique from view of the second stereo image, and wherein each of the first stereo image and the second stereo image comprise one or more target objects, generate, by using the first stereo image and the second stereo image, a path for a trained Reinforcement Learning (RL) agent in the continuous action space to reach a target object from the one or more target objects, track (or monitor) in the continuous action space, a path of the trained RL agent taken to reach the target object, generate, by using a deep deterministic policy gradient technique, an action policy for the tracked path (or monitored path) taken by the trained RL agent to reach the target object, and assign one or more action utility values for each action performed by the trained RL agent to reach the target object based on the generated action policy, wherein each of the action is assigned one or more action utility values by: computing via the one or more hardware processors, by using the generated action policy, one or more distance parameters based on a distance between centroid of an end effector of the trained RL agent and the target object; and assigning via the one or more hardware processors, using the one or more distance parameters, one of a positive action utility value or a negative action utility value to each of the action being performed based on a distance information of the target object estimated based on the view associated with the first stereo image and the second stereo image and a comparison of (i) an action performed by the end effector of the trained RL agent to reach the target object and (ii) a pre-defined action pertaining to the one or more pre-stored action policies stored in a neural network, wherein the neural network is trained using the deep deterministic policy gradient learning technique, and wherein the one or more action pre-stored action policies are learnt by updating an actor critic network path specific to the unstructured 3D environment during training of the RL agent.
In an embodiment, a positive action utility value is assigned to the trained RL agent for an applied action (or a current action) when distance between the end effector and the target object is less than or equal to a pre-defined threshold (a current pixel wise distance). In another embodiment, a negative action utility value is assigned to the trained RL agent for an applied action (or current action) when distance between the end effector and the target object is greater than a pre-defined threshold (a current pixel wise distance).
In an embodiment, the one or more hardware processors may be further configured by the instructions to: create one or more composite images from a first set of stereo images captured by a first image capturing device and a second set of stereo images captured by a second image capturing device, wherein view of each image from the first set of stereo images is unique from view of each image from the second set of stereo images and comprises the unstructured 3D environment, and wherein the one or more composite images are created by: resizing an input dimension pertaining to the first set of stereo images and the second set of stereo images to obtain a set of resized images; converting the set of resized images to one or more YUV color space images; extracting Y channel from the one or more YUV color space images to obtain one or more extracted Y channels; and concatenating one or more vertical dimensions of the one or more extracted Y channels to create the one or more composite images.
In yet another aspect, one or more non-transitory machine readable information storage mediums comprising one or more instructions is provided. The one or more instructions which when executed by one or more hardware processors causes a method for stereo image views based path planning for reinforcement learning agents in an unstructured 3D environment having a continuous action space. The method executed by using the instructions comprises receiving a first stereo image from a first Red Green Blue (RGB) image capturing device and a second stereo image from a second RGB image capturing device, wherein the first stereo image and the second stereo image are specific to the unstructured 3D environment having the continuous action space, wherein view of the first stereo image is unique from view of the second stereo image, and wherein each of the first stereo image and the second stereo image comprise one or more target objects; generating by using the first stereo image and the second stereo image, a path for a trained Reinforcement Learning (RL) agent in the continuous action space to reach a target object from the one or more target objects; tracking (or monitoring) in the continuous action space, a path of the trained RL agent taken to reach the target object; generating by using a deep deterministic policy gradient technique, an action policy for the tracked path (or monitored path) taken by the trained RL agent to reach the target object; and assigning one or more action utility values for each action performed by the trained RL agent to reach the target object based on the generated action policy, wherein the step of assigning the one or more action utility values comprises: computing by using the generated action policy, one or more distance parameters based on a distance between centroid of an end effector of the trained RL agent and the target object, and assigning via the one or more hardware processors, using the one or more distance parameters, one of a positive action utility value or a negative action utility value to each of the action being performed based on a distance information of the target object estimated based on the view associated with the first stereo image and the second stereo image and a comparison of (i) an action performed by the end effector of the trained RL agent to reach the target object and (ii) a pre-defined action pertaining to the one or more pre-stored action policies stored in a neural network, wherein the neural network is trained using the deep deterministic policy gradient learning technique, and wherein the one or more action pre-stored action policies are learnt by updating an actor critic network path specific to the unstructured 3D environment during training of the RL agent.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary block diagram of a system for stereo image views based path planning for reinforcement learning agents in a continuous action space according to an embodiment of the present disclosure.
FIG. 2 illustrates an exemplary simulation environment of the system of FIG. 1 for stereo image views based path planning for reinforcement learning agents in a continuous action space in accordance with an example embodiment of the present disclosure.
FIG. 3 illustrates an exemplary flow diagram of a method for stereo image views based path planning for reinforcement learning agents in a continuous action space using the system of FIG. 1 in accordance with an embodiment of the present disclosure.
FIG. 4A-4D depict a suction based manipulator (RL agent) of the proposed system of FIG. 1 in accordance with an example embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
Manipulating objects in real world dynamic scenarios is one of the most challenging problems in Robotics. The difficulty of the task can be gauged from the fact that even in highly structured tasks such as picking objects from warehouse shelves, performance has not been achieved anywhere near human level performance as demonstrated previously. However, recent developments in deep reinforcement learning offer directions for developing adaptive vision based controllers that can operate in dynamic environments.
With the progress of deep learning techniques in the recent past, this problem has become considerably simpler, with deep neural networks making it possible for the agent to automatically identify high level features encoding the state of a system from low level data such as raw pixel values. This has led to a resurgence of RL techniques to solve complicated tasks, which were intractable before its integration with deep learning. It has widely been surmised that the deep reinforcement learning (DRL) framework is capable of learning many human level tasks, having even surpassed human level performance in tasks like learning to play games.
Embodiments of the present disclosure provide a deterministic policy based actor-critic learning framework to encode the path planning strategy irrespective of the robot pose and target object position. This reinforcement learning (RL) agent solely uses two different views of the environment to learn about path planning in order to reach a given target from a random pose, instead of relying on a depth sensor. The state-space for the RL agent is thus defined as the stereo-view of the environment whereas the action values are torques applied to the robot’s joints. The action utility value function is defined on the relative distance between the end effector and target object in pixels. In episodic learning framework, the actor-critic network learns the optimal actions in the continuous space of real numbers for a given state configuration by trying to increase the expected action utility value.
The present disclosure provides system and methods that demonstrate the validation of this approach in a simulated environment yielding 100% success rate from 100 different robot poses, with relatively few steps required on an average to reach the target. The present disclosure further shows that the learning strategy bests deep Q-learning based methods which have been used for similar path planning purpose. This path planning approach does not require conventional feature matching and triangulation for object localization which is error prone and inaccurate, and solves inverse kinematics and depth estimation using only the scene information.
Present disclosure proposes a vision based deep RL technique for path planning of a robotic manipulator, with the end goal of reaching a target object from any arbitrary initial pose in an unstructured 3D environment. The typical solution for this problem in real-world applications requires object localization in a 3D environment, which is used by the inverse kinematic solver for path planning. The agent trained for another application using a Deep Q-Network (DQN) demonstrated that the present disclosure/proposed system can apply RL for similar problems where action strategy can be learned purely on raw scene images. This obviates the need for solving the perception and planning problem individually. Instead the control strategy for the agent based on raw images can be directly learnt, where the learning objective is to minimize the temporal difference error. Based on this approach, DQN has also been applied successfully to robotic control tasks such as path planning. However, despite the success of DQN in learning actions from high dimensional observation spaces, its performance while dealing with high dimensional action spaces remained sub-optimal. Also many physical control tasks have a continuous action space; coarse discretization such action spaces for the purpose of applying DQN severely limits the dexterity in performance of tasks which require finer control. On the contrary, a finer discretization of the action space leads to an exponential increase in the dimensionality of the action space, which is often unamenable to deep Q learning. Consequently, training a DQN to perform a task with an inherent continuous action space required enormous amount of data in order to ensure sufficient coverage of the continuous action space after discretization failing which the agent could not learn viable actions. Thus, the need for a RL agent which can work with a continuous action space instead of discrete one was imminent in order to mitigate the tradeoff of dexterity and action space dimensionality. This has been elegantly achieved by proposed system by leveraging the success of deep Q-learning for continuous control tasks using a deterministic policy gradient based actor-critic algorithm. The present disclosure and its systems and methods adopt this approach to an actor critic algorithm using deterministic policy gradient to learn the action value at each state of the environment. The only information about the state of a system that is provided to the network is in the form of a stereo image pair taken from static RGB cameras. The action space comprises continuous and real valued torques, applied to each joint of the robot. The rewards (action value(s)) which guide the training are computed from a relative change in distance between the end effector of the manipulator and the target object computed on the image plane. Further, to establish the advantage of using a continuous action space, a similar training strategy with a discrete action space is applied to train a deep Q network and its performance is compared with the deterministic policy gradient (DPG) based method.
The primary contributions made in by the systems and methods and embodiments of the present disclosure are: 1) A stereo vision based technique for robotic path planning in a 3D environment using a continuous action space and 2) An implicit depth estimation from a pair of grayscale stereo images using a deep actor-critic network. The trained RL agent is able to develop an implicit perception of depth of the target object from the stereo images that it is trained upon, and is thus able to consistently guide the manipulator to approach the target object. The experiments performed by the systems and methods of the present disclosure (discussed in later sections) reveal that the DDPG based actor-critic agent trained on a continuous action space agent outperforms a learned DQN agent trained on a quantized discrete action space in terms of the number of steps required to reach the target as well as training episodes required.
Referring now to the drawings, and more particularly to FIG. 1 through 4D, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary block diagram of a system 100 for stereo image views based path planning for reinforcement learning agents in a continuous action space according to an embodiment of the present disclosure. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 may be one or more software processing modules and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the device 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment a database 108 can be stored in the memory 102, wherein the database 108 may comprise, but are not limited to information pertaining to resizing an input dimension pertaining to set of stereo images obtained from different image capturing devices, information on resized images, information on one or more YUV color space images, information on extracted Y channel from the one or more YUV color space images and information on concatenated one or more vertical dimensions of the one or more extracted Y channels to create the one or more composite images. In an embodiment, the memory 102 may store the above information, which are utilized by the one or more hardware processors 104 (or by the system 100) to perform the methodology described herein.
FIG. 2, with reference to FIG. 1, illustrates an exemplary simulation environment of the system 100 of FIG. 1 for stereo image views based path planning for reinforcement learning agents in a continuous action space in accordance with an example embodiment of the present disclosure. FIG. 3, with reference to FIGS. 1-2, illustrates an exemplary flow diagram of a method for stereo image views based path planning for reinforcement learning agents in a continuous action space using the system 100 of FIG. 1 and simulation environment of FIG. 2 in accordance with an embodiment of the present disclosure. In an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to the components of the system 100 as depicted in FIG. 1, and FIG. 2, and the flow diagram of FIG. 3. In an embodiment of the present disclosure, at step 302, the one or more hardware processors 104 receive a first stereo image from a first Red Green Blue (RGB) image capturing device (e.g., an image capturing device 206 of FIG. 2 - say a camera 206 of FIG. 2 positioned at right side of a (trained) RL agent 202 of FIG. 2) and a second stereo image from a second RGB image capturing device 208 of FIG. 2 (e.g., an image capturing device - say a camera 208 of FIG. 2 positioned at left side of the (trained) RL agent 202), wherein the first stereo image and the second stereo image are specific to the unstructured 3D environment having the continuous action space, wherein view of the first stereo image is unique from view of the second stereo image, and wherein each of the first stereo image and the second stereo image comprise one or more target objects (e.g., a target object 210 of FIG. 2). In an embodiment, the expression ‘target object’ may be referred as one of ‘target’, or ‘object’, object for handling, target for handling, target for picking from one location and placing it at another location, and the like. In an embodiment of the present disclosure, at step 304, the one or more hardware processors 104 generate by using the first stereo image and the second stereo image, a path for the trained Reinforcement Learning (RL) agent 202 in the continuous action space to reach the target object 210 from the one or more target objects. In an embodiment of the present disclosure, at step 306, the one or more hardware processors 104 track (or monitor) in the continuous action space, a path of the trained RL agent 202 taken to reach the target object 210. In an embodiment of the present disclosure, at step 308, the one or more hardware processors 104 generate by using one or more deterministic policy gradient technique(s) (e.g., deep deterministic policy gradient technique(s) which includes deep Q learning neural networks), an action policy for the tracked path (or the monitored path) taken by the trained RL agent 202 to reach (and/or handling) the target object 210 to be placed at another location from its current location. In an embodiment of the present disclosure, at step 310, the one or more hardware processors 104 assign one or more action utility values for each action performed by the trained RL agent 202 to reach the target object 210 based on the generated action policy, wherein the step of assigning the one or more action utility values comprises: computing (310a) via the one or more hardware processors, by using the generated action policy, one or more distance parameters based on a distance between centroid of an end effector of the trained RL agent 202 and the target object 210, and assigning (310b) via the one or more hardware processors, using the one or more distance parameters, one of a positive action utility value or a negative action utility value to each of the action being performed based on a distance information of the target object 210 estimated based on the view associated with the first stereo image and the second stereo image and a comparison of (i) an action performed by an end effector 204 of the trained RL agent 202 to reach the target object 210 and (ii) a pre-defined action pertaining to the one or more pre-stored action policies stored in a neural network. In an embodiment of the present disclosure the neural network is trained using one or more deep deterministic policy gradient learning technique(s). In another embodiment, the one or more action pre-stored action policies are learnt by updating an actor critic network path specific to the unstructured 3D environment during training of the RL agent 202. In an embodiment, embodiments of the present disclosure and the system 100 may assign action utility value (e.g., a positive action utility value or a negative action utility value) using one or more distance parameters estimated based on a distance between centroid of an end effector of the trained RL agent 202 and the target object 210. In other words, the embodiments of the present disclosure and the system 100 may not be required to rely on the depth information estimated for assigning one or more action utility values (e.g., a positive action utility value or a negative action utility value) for one or more actions being performed by the RL agent 202.
In an embodiment of the present disclosure, a positive action utility value is assigned to the trained RL agent 202 for an applied action (e.g., a step taken, or a current state of the RL agent 202) when distance (e.g., 2 meters) between the end effector 204 and the target object 210 is less than or equal to a pre-defined threshold (e.g., say 3 meters).
In an embodiment of the present disclosure, a negative action utility value is assigned to the trained RL agent for an applied action (e.g., a step taken, or a current state of the RL agent 202) when distance (e.g., say 4 meters) between the end effector 204 and the target object 210 is greater than a pre-defined threshold (e.g., 3 meters).
In an embodiment of the present disclosure, the RL agent 202 is trained by creating a repository of one or more composite images from a first set of stereo images captured by the first image capturing device and a second set of stereo images captured by the second image capturing device. In an embodiment of the present disclosure, the repository of the one or more composite images comprising the first set of stereo images and the second set of stereo images may be stored in the memory 102. As discussed above, view of each image from the first set of stereo images is unique from view of each image from the second set of stereo images and comprises the unstructured 3D environment (having the continuous action space). In an embodiment, the step of creating one or more composite images comprises: resizing an input dimension pertaining to the first set of stereo images and the second set of stereo images to obtain a set of resized images, converting the set of resized images to one or more YUV color space images, extracting Y channel from the one or more YUV color space images to obtain one or more extracted Y channels, and concatenating one or more vertical dimensions of the one or more extracted Y channels to create the one or more composite images. In an embodiment, the YUV color space (or model) represents the human perception of color more closely than the standard RGB model used in computer graphics hardware. In YUV, Y is the luminance (brightness) component while U and V are the chrominance (color) components.
Below is an illustrative RL formulation of the system 100:
RL Formulation:
The system 100 and the embodiments of the present disclosure formulate the problem of target reaching from an arbitrary initial pose of the robot as a Markov Decision Process (MDP), whereby the agent learns a policy p which maps a sequence of images into a sequence of actions in the form of efforts applied to the robot’s joints. The system 100 trains a deep Q-network (e.g., using a deep deterministic policy gradient technique) to learn this task using a simulated version of a 7 degree of freedom manipulator. The 3D environment comprising the robot and the target objects are simulated using the Gazebo simulator. A pair of RGB cameras mounted at static positions are also simulated in order to capture two 2D snapshots of the 3D scene. FIG. 2 shows the simulated environment used in our experiments.
Action Space:
The action space for the path planning task comprises continuous efforts applied at each joint defined as follows:
A?R^N (1)

Although the manipulator used in the experiments by the system 100 has 7 degrees of freedom, only 4 degrees of freedom were used for initial experiments since it was observed that the last 3 joints are chiefly responsible for maneuvering the wrist and thus mostly play a redundant role in the task of reaching a target object. Since 4 joints are actuated, N is 4 in the proposed experiments. An action a ?A is thus a 4 dimensional vector and comprises real values.
State Space Representation:
The state space S is defined in terms of 320 × 240 RGB images captured by the cameras are resized to 84 × 42, and converted to the YUV color space. The luminance (Y) channel of each image is extracted. These steps are necessary for reducing the input dimensionality. The Y channels from the images of two cameras are concatenated along their vertical dimension to create a composite image of size 84 × 84. Let this composite image be I(x,y), which constitute the raw state input to the network. I(x,y) resized to a 7056 dimensional vector is the state s of the system.
Action utility value assignment:
The system 100 designs an action utility value function (also referred as ‘Action value’) for the proposed task which facilitates incremental learning by designating all intermediate steps which bring the manipulator closer to the target as favorable steps towards the ultimate goal of reaching the target. Intuitively, the distance between the end effector and the target is a good measure in this regard. In the proposed simulation environments this distance from the 3D coordinates of the end effector and the target object can be readily computed. However, for a real robot interacting with the physical world, access to these 3D coordinates is unavailable. Thereby, in order to make the proposed framework compliant with real world constraints, a measure of this distance from the 2D camera images available is chosen for computation.
To compute the distance between the end effector 204 and the target object 210, the components of FIG. 2 may be marked with respective markers for example, say red color for the end effector 204 and blue color for the target object (color markers not shown in FIG. 2). These color blobs/markers from the images taken by the two cameras, and compute the distance between the centroids of these two blobs in the image plane. To assign the action value(s) based on these two distances, it is also taken into account the fact that in some states, the end effector might be occluded in any one of these two views. Let d_1 (t) and d_2 (t) be the distances of the centroid of the end effector to that of the target object computed from the two camera images at the t^th step. Then the change in distance of the target from the end effector with respect to the previous state for the two cameras is given by expression/equation 2:
+ ¦(?d_1=d_1 (t-1)-d_1 (t)@?d_2=d_2 (t-1)-d_2 (t) )} (2)
The measures of the relative change in distance between two consecutive states, ?d_1 or ?d_2 will be undefined if the end effector is in an occluded state at either the current step t or at the previous step t-1 in the corresponding camera view. ?d_1 and ?d_2 are normalized between -1 and +1. A state is defined as a goal state or terminal state when the end effector reaches the vicinity of the target defined by a threshold ? applied on the average distance ((?(d?_1 (t)+d_2 (t)))/2) with respect to the two cameras. The positions of the target relative to the cameras is chosen such that the end effector is unoccluded in both the camera views in the proximity of the target, so that the average distance computed from the images unambiguously reflects the actual distance of the end effector from the target in the 3D environment. When the goal state is attained, terminal flag t is set indicating the end of a learning episode, i.e. t=1 when ((?(d?_1 (t)+d_2 (t)))/2)=?. The actual value function is then defined as follows:
r_t={¦(¦(1, if t=1,@-1, ?d_1=1 and ?d_2=0,)@?d_(1,) if ?d_2 is undefined,@?d_(2,) if ?d_1 is undefined@((?d_(1,)+ ?d_2))/2 otherwise)¦ (3)
The agent is given a high positive action value (e.g., high positive reward, say +1) whenever it taken an action that brings it to the goal state (e.g., in close vicinity of the target object). Sometimes, a joint may reach the limit of its joint angle when consecutively applied efforts force it to move in any one direction. In such a state, the robot joint cannot move any further in that direction even if the effort applied at the next step actuates it to move in that direction. If all the joints are at their joint angle limits, and they do not move any further on application of the next action, then the system would not change its state. This indicates that the applied action at this state is a redundant one, and is therefore heavily penalized with an active value (e.g., negative action value, say reward of -1). In all other cases, an action value is assigned proportional to the change in distance from the target on application of the action. If the applied action reduces the end effector’s distance from the target, it gets a positive action value and vice versa.
Methodology of the present disclosure proposed by the system 100:
The raw state information available to the agent is a vector of pixel values, denoted by s. At each training step, the experience tuple e_t=(s_t,a_t,r_t,s_(t+1)) is stored in the replay memory D, where s_t is the present state at is the current action, r_t is the observed action value, and s_(t+1) is the next state. For the initial P steps of the training, where P<|D|, random actions are chosen at each step, and the replay memory is populated with the resulting experience, without updating the network parameters. After the first P steps, learning commences, and the network is updated on mini batches of size M sampled at random from the replay memory.
An actor-critic network trained using the DPG approach is used by the present disclosure to learn the action policy from the pixel observations representing the state of the system. The system 100 employ a parameterized actor function µ(s|?^µ ¦) which deterministically maps states to actions using the current policy µ. The current actor policy is updated by the critic Q(s,a) which itself is learned using the Bellman equation as in Q learning.
Proposed Algorithm/Proposed technique:
RL based methods seek to maximize the expected future action utility values. The discounted future action utility values from a state is defined as:
R_t ?_(i=t)^T¦??^(t-i) r_t ? (4)
Here, ??[0,1] is the discount factor for future action utility values. In this RL formulation, the policy p is learned such that expected return from the start distribution is maximized, where the start distribution:
J=E_(s_i,a_i,r_i ) [R_1]
Assuming a deterministic target policy µ:S?A, the Bellman equation is used for the proposed RL paradigm to get the optimal action utility values of the actor as follows:
Q^µ (s,a)=E_(s^' ) [r+?(_a'^max)Q^µ (s^',a^' )[s,a] (5)
where ? is the discount factor for future action utility values. Thus, the optimal action utility value function is arrived at by choosing that action which maximizes the expected future action utility values. In this context, the CNN with weights ?^Q acts as a nonlinear approximator for this action value function. The target value at the t^th step is approximated as follows:
z=r_t+ ?(_a^'^max)Q(s^',a^';?^Q) (6)
Using these approximate target values, the loss function at the t^th training step is given by:
L_t (?^Q )=E_(s,a,r) [(E_(s^' ) [z¦s,a]-Q(s,a;?_t )^2] (7)
Q-learning is one such RL technique which minimizes the above loss function to learn the optimal action utility value function and thereafter a greedy approach (or greedy algorithm/technique) is employed to select the optimal action. A greedy algorithm/technique is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. However for a continuous action space finding such a greedy policy is impractical. The actor critic reinforcement learning model based on deterministic policy gradient proved to be useful in this scenario. In this learning framework, an actor function µ(s|?^µ) deterministically maps states to continuous action utility values. The critic Q(s,a|?^Q) is updated using the Bellman equation similar to Q-learning. The actor is updated by applying the chain rule on the expected return from the start distribution with respect to the network parameters as follows:
?_(?^µ ) J˜E_(s_t ) [?_(?^µ ) Q(s,a¦?^Q )¦s=s_t,a=µ(s_t¦?^µ ) ]=E_(s_t ) [?_(?^µ ) Q(s,a¦?^Q )¦s=s_t,a=µ(s_t ) ?_(?^µ ) µ(s¦?^µ )|s=s_t ] (8)
Unlike supervised learning, the targets are not fixed, and depend upon the network parameters. The proposed technique of the present disclosure employs the soft target update strategy where a copy of the actor and critic networks, µ^' (s|?^(µ^' )) and Q^' (s,a|?^(Q^' )) are utilized for computing the target values. The weights of the target network are then updated as follows:
?^'???+(1-?) ?^' where ?«1 (9)
The above update strategy lets the weights of the target network change slowly according to the learned weights, thereby ensuring the stability of the learning process. In order to perform exploration efficiently over a continuous action space, the present disclosure resort to the Ornstein-Uhlenbeck process whereby a noise sampled from a temporally correlated noise process N is added with inertia to the action utility values generated by the actor as follows:
µ^' (s_t )=µ(s_t |?_t^µ+N (10)
The network parameters are updated at each training step after the first P steps utilized for building the replay memory. The network parameters and the soft target are updated using equation (9) in order to make the target networks track the parameters of the updated network.
Network Architecture:
The input to the network is the composite image of size 84×84 obtained after preprocessing. An actor-critic network architecture is employed, where the size of the final layer is equal to the number of degrees of freedom of the robotic manipulator used for the task. The networks in the proposed system 100 comprise 3 convolutional layers having 32 filters each followed by 2 fully connected layers, each of 200 units. All hidden layers used a rectifier non-linearity activation. Batch normalization is employed at the input and after each hidden layer.
For the sake of comparison, a deep Q network was implemented for the same path planning task along with a network architecture, but the fully connected output layer was modified to account for the much larger discrete action space used in the experiments of the present disclosure and system 100 in comparison to that required. It must be pointed out that size of the fully connected layer at the output is typically much smaller for the actor-critic network than the Q network due to the discretization of the action space.
Below is an illustrative algorithm of Deep Q-network Training implemented by the system 100 of FIG. 1 by way of example:
Input: Composite image I_t and goal state information t
Initialization: Replay memory D={e_1,e_2,…e_p}; actor network µ(s¦?^µ ) and critic network Q(s,a|?^Q) with random weights ?^µ and ?^Q respectively; target networks µ^' and Q^' with weights ?^(µ^' ) and ?^(Q^' ) respectively.
for t=1 to 10,000 do
Initialize random process N for exploration.
while (t?1) do
Preprocess I_t and vectorize to represent state s_t.
Select random action a_t=µ(s_t¦?^µ )+N
Execute a_t in simulator, and obtain r_t,I_(t+1) and t
Preprocess I_(t+1) and vectorize to represent state s_(t+1)
Add the transition (s_t,a_t,r_t,s_(t+1) ) in D.
if t=P then
Sample a random mini batch of size M consisting of transitions (s_i,a_i,r_i,s_(i+1) ) from D.
Set z_i=r_i+?Q^' (s_(i+1),µ^' (s_(i+1) |?^(µ^' ))|?^(Q^' ))
Update critic by minimizing the loss L=1/M ?_i¦?(z_i-Q(s_i,a|?^Q)? )^2 with respect to network parameters ?^Q.
Update the actor network using sampled policy gradient as follows:
?_(?^µ ) J˜1/M ?_i¦? aQ(s,?a|?_(s=si,a=µ(s_i ) ) ??^µ µ(s|?^µ |_(s_i )
Update the target actor and critic networks as follows:
?^(Q^' )???^Q+(1-?)?^(Q^' )
?^(µ^' )???^µ+(1-?)?^(µ^' )
end if
end while
end for
Experimental Results:
A 7 degree of freedom Barrett Wide Arm Manipulator simulated was used in Gazebo using ROS Indigo as the interface between the proposed above algorithm and the simulated robot. Torch7 with Cuda7.5 were chosen as the deep learning library to construct and train the deep Q-network. All the experiments are performed on an ubuntu 14.04 system with 16 GB RAM, Intel Core-i7-4810MQ processor and NVIDIA Quadro K2100M GPU.
Four degrees of freedom of the manipulator was used for the reaching task. So, the size of the action space N_A was 4 in conducted experiments. The value of the threshold ? was set to 10. The performance of the proposed path planning algorithm is critically dependent on the choice of several hyperparameters. The values used for these hyperparameters are summarized along with their definitions for DDPG and DQN trainings in Tables 1 and 2. More particularly, Table 1 depicts Hyperparameters used for DDPG training by way of examples, and Table 2 depicts Hyperparameters used for DQN training.
Table 1
Parameters Definition Value
P No. of initial steps 1000
M Mini batch size 100
? Exploitation probability 1 to 0.1 annealed linearly over 10000 steps
C Target networks’ update frequency 1
|D| Size of replay memory 10,000
a Learning rate Actor - ?10?^(-4), critic - ?10?^(-3)
? Discount factor 0.99
? Ornstein-Uhlenbeck process mean 0
s Ornstein-Uhlenbeck process variance 0.5
? Ornstein-Uhlenbeck process inertia 0.15
? Soft target update factor ?10?^(-2)

Table 2
Parameters Definition Value
P No. of initial steps 1000
M Mini batch size 100
? Exploitation probability 1 to 0.1 annealed linearly over 10000 steps
C Target network update frequency 100
|D| Size of replay memory 10,000
a Learning rate 0.0025
? Discount factor 0.2

For the actor network, a L2 weight decay of ?10?^(-2) was used. The weights of the final layers of both the actor and critic networks are initialized from a uniform distribution [-3×?10?^(-4),3×?10?^(-4)] whereas the weights of all other layers are initialized from a uniform distribution of [-1/vf,1/vf] where f is the fan-in of the layer. For the deep Q-network, all layers were initialized using the Xavier initialization.
In the following sections, various aspects of the performance of the proposed deep reinforcement learning based path planning framework were analysed.
Training Statistics:
Although the combined task of reaching the target, grasping and finally lifting it as attempted in existing approaches is much more complicated than the task of reaching a target which the present disclosure and embodiments of the present disclosure are attempting presently, reaching may be often be sufficient if a suction based actuator is used for lifting an object, which was often the case (see FIG. 4A-4D). FIG. 4A-4D, with reference to FIGS. 1through 3, depict a suction based manipulator (RL agent) of the proposed system 100 of FIG. 1 in accordance with an example embodiment of the present disclosure. Specifically, FIG. 4A depicts the trained RL agent in which the arm is required to (a) reach the target object and pick it up using suction in accordance with an example embodiment of the present disclosure. FIG. 4B depicts the trained RL agent in which transport of the target object to the correct location is performed in accordance with an example embodiment of the present disclosure. FIG. 4C depict the trained RL agent in which it is choosing the correct box in accordance with an example embodiment of the present disclosure. FIG. 4D depicts the trained RL agent where it stows the target object in accordance with an example embodiment of the present disclosure. It is to be understood by a person having an ordinary skill in the art that while this is for a retail application, it is easy to see how this may be useful for autonomous robots that are required to place small items like medicine bottles, etc. in their proper location.
Interestingly, the RL agent in proposed approach learns to reach much faster; whereas the existing approaches reported that the agent learned to complete the entire task successfully in 1800 episodes (in their case each episode ends after 1000 iterations or completion of a task), when starting from the same initial pose, the RL agent of the present disclosure learnt the reaching task within just 1500 iterations. This is just 500 training steps after building the replay memory. This tremendous improvement in learning speed can be attributed to the much smaller size of the output layer in case of the policy gradient based actor-critic learning framework, which results in fewer network parameters to be learnt. Even for yet another conventional method, the agent used for path planning from random initial poses was trained for 5.225 million training steps, which is far greater than the number of training steps required for the agent to learn in the proposed approach of the present disclosure. The embodiments and the systems and methods of the present disclosure hereby surmise that the larger action space used in the experiments combined with the action utility value assignment strategy used for incremental learning is responsible for this improvement.
Testing performance:
A performance metric based on the average action utility value achieved per episode was adopted in order to provide a benchmark for the performance and performance of the proposed algorithm of the present disclosure was quantified in terms of the average number of steps required to reach the target object, since unlike the action value for the tasks in existing methods, the proposed action utility value function is characterized by sharp discontinuities on reaching the goal, as well as on executing redundant actions. In order to evaluate the performance of the proposed path planning algorithm, a test set consisting of 100 random initial poses of the manipulator/RL agent was created. These 100 poses are obtained by setting the joint angles of the first 4 joints to random values which lie within the respective joint angle ranges, and keeping the angles of the last 3 joints (those that are responsible for the movement of the wrist) fixed at 0 radians. The performance of the trained RL agent 202 was further evaluated for different values of ?, and for different levels of training progress in terms of the percentage of successful attempts to reach the goal state, and the average number of steps required to reach the goal starting from each arbitrary initial pose in the test set. With these experimental setting, the performance of the DDPG algorithm summarized in Table 3. More particularly, Table 3 depicts DDPG testing Performance illustrated below by way of example:
Table 3
?
Value Percentage (%) of successful attempts Average no. of steps
1500 2000 5000 7000 8500 1500 2000 5000 7000 8500
0.99 100% 100% 100% 100% 100% 1.95 2.92 2.42 2.21 1.98

Table 4 depicts DQN testing Performance illustrated below by way of example:
Table 4

?
Value Percentage (%) of successful attempts Average no. of steps
1500 3500 5500 7500 10000 1500 3500 5500 7500 10000
0.2 31% 100% 81% 100% 100% 21.31 2.52 8.86 3.15 2.96
0.5 62% 64% 89% 97% 96% 16.19 12.35 6.11 4.7 6.12

For the DQN algorithm, however, training with ?=0.99 yields worse performance than for lower values of ?. This point towards lower contribution of the previously achieved action utility values on the current action policy which is probably a consequence of the larger output layer size of the DQN. The present disclosure reports the performance of the proposed algorithm for two different values of the discount factor ? for Q learning in Table 3. Overall, ?=0.2 gave the best performance both in terms of success rate, and number of steps required to reach the target on an average. Increasing gamma beyond 0.5 worsens the performance, with the RL agent failing to reach the target in most cases for ?=0.9. It can be seen that the technique of DDPG applied for target reaching outperforms DQN by a considerable margin both in terms of training steps required and success rate.
Conclusion:
A vital capability for autonomous robots is the ability to manipulate objects in unstructured environments without prior knowledge about joint angles and object positions. In the present disclosure, systems and methods provide an approach for robot arm control in the continuous action space using an actor-critic based RL framework trained using deterministic policy gradient. The proposed approach relies on stereo vision and does not require an intermediate perception module or additional information about the robot position and joint angles. This is demonstrated through the path planning of a robotic manipulator (also referred as the RL agent 202). A deep Q network has been employed by the embodiments of the present disclosure to perform the path planning using the same experimental setup. The experimental results demonstrated that the DDPG based learning technique outperforms the DQN based learning by a substantial margin both in terms of the number of learning steps required as well as success rate.
In order to make the RL agent learn an implicit perception of depth exploiting the visual cues available in a 3D environment, the embodiments of the present disclosure propose a stereo vision based learning scheme, in which the agent is trained on pair of RGB camera images, using simulated camera models to capture snapshots of the 3D environment. It is expected that the proposed approach will be able to better generalize when applied to real world scenarios. The fact that the 3D simulator (Gazebo) is used, it is well equipped to simulate the physics of the real world such as gravity and collision properties is also expected to improve generalization. Another notable difference of the proposed approach with the conventional technique(s) with respect to the action utility value function used is that the distances used in the computation of their inverse exponential reward function are obtained directly using the x, y and z coordinates obtained from the 3D simulator. Whereas, the embodiments of the present disclosure refrain from using the 3D coordinates obtained from the simulator, and instead use 2D camera images to estimate the actual three dimensional distance between the end effector and the target. Pre-trained networks trained further on real robots has the potential to vastly improve the performance of target reaching, but 3D coordinates would not be available while training in the real world. This makes the proposed scheme of action utility value assignment much more amenable to be transferred to exploitive learning in real world scenarios than the conventional technique(s). The use of a continuous action space unlike the discrete ones employed in traditional systems and methods substantially aids the learning process as discussed further in below sections.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, BLU-RAYs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.
Claims:WE CLAIM:

1. A processor implemented method for stereo image views based path planning for reinforcement learning agents in an unstructured 3D environment having a continuous action space, comprising:
receiving (302), via one or more hardware processors a first stereo image from a first Red Green Blue (RGB) image capturing device and a second stereo image from a second RGB image capturing device, wherein the first stereo image and the second stereo image are specific to the unstructured 3D environment having the continuous action space, wherein view of the first stereo image is unique from view of the second stereo image, and wherein each of the first stereo image and the second stereo image comprise one or more target objects;
generating (304) via the one or more hardware processors, by using the first stereo image and the second stereo image, a path for a trained Reinforcement Learning (RL) agent in the continuous action space to reach a target object from the one or more target objects;
tracking (306) via the one or more hardware processors, in the continuous action space, a path of the trained RL agent taken to reach the target object;
generating (308) via the one or more hardware processors, by using a deep deterministic policy gradient technique, an action policy for the tracked path taken by the trained RL agent to reach the target object; and
assigning (310) via the one or more hardware processors, one or more action utility values for each action performed by the trained RL agent to reach the target object based on the generated action policy, wherein the step of assigning the one or more action utility values comprises:
computing (310a) via the one or more hardware processors, by using the generated action policy, one or more distance parameters based on a distance between centroid of an end effector of the trained RL agent and the target object; and
assigning (310b) via the one or more hardware processors, using the one or more distance parameters, one of a positive action utility value or a negative action utility value to each of the action being performed based on a distance information of the target object estimated based on the view associated with the first stereo image and the second stereo image and a comparison of (i) an action performed by the end effector of the trained RL agent to reach the target object and (ii) a pre-defined action pertaining to the one or more pre-stored action policies stored in a neural network, wherein the neural network is trained using the deep deterministic policy gradient technique, and wherein the one or more action pre-stored action policies are learnt by updating an actor critic network path specific to the unstructured 3D environment during training of the RL agent.

2. The processor implemented method of claim 1, wherein a positive action utility value is assigned to the trained RL agent for an applied action when distance between the end effector and the target object is less than or equal to a pre-defined threshold.

3. The processor implemented method of claim 1, wherein a negative action utility value is assigned to the trained RL agent for an applied action when distance between the end effector and the target object is greater than a pre-defined threshold.

4. The processor implemented method of claim 1, further comprising creating one or more composite images from a first set of stereo images captured by a first image capturing device and a second set of stereo images captured by a second image capturing device,
wherein view of each image from the first set of stereo images is unique from view of each image from the second set of stereo images and comprises the unstructured 3D environment, and
wherein the step of creating one or more composite images comprises:
resizing an input dimension pertaining to the first set of stereo images and the second set of stereo images to obtain a set of resized images;
converting the set of resized images to one or more YUV color space images;
extracting Y channel from the one or more YUV color space images to obtain one or more extracted Y channels; and
concatenating one or more vertical dimensions of the one or more extracted Y channels to create the one or more composite images.

5. A system (100) for stereo image views based path planning for reinforcement learning agents in an unstructured 3D environment having a continuous action space, the system (100) comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive a first stereo image from a first Red Green Blue (RGB) image capturing device and a second stereo image from a second RGB image capturing device, wherein the first stereo image and the second stereo image are specific to the unstructured 3D environment having the continuous action space, wherein view of the first stereo image is unique from view of the second stereo image, and wherein each of the first stereo image and the second stereo image comprise one or more target objects;
generate, by using the first stereo image and the second stereo image, a path for a trained Reinforcement Learning (RL) agent in the continuous action space to reach a target object from the one or more target objects;
track, in the continuous action space, a path of the trained RL agent taken to reach the target object;
generate, by using a deep deterministic policy gradient technique, an action policy for the tracked path taken by the trained RL agent to reach the target object; and
assign one or more action utility values for each action performed by the trained RL agent to reach the target object based on the generated action policy, wherein each of the action is assigned one or more action utility values by:
computing via the one or more hardware processors, by using the generated action policy, one or more distance parameters based on a distance between centroid of an end effector of the trained RL agent and the target object; and
assigning via the one or more hardware processors, using the one or more distance parameters, one of a positive action utility value or a negative action utility value to each of the action being performed based on a distance information of the target object estimated based on the view associated with the first stereo image and the second stereo image and a comparison of (i) an action performed by the end effector of the trained RL agent to reach the target object and (ii) a pre-defined action pertaining to the one or more pre-stored action policies stored in a neural network,
wherein the neural network is trained using the deep deterministic policy gradient technique, and wherein the one or more action pre-stored action policies are learnt by updating an actor critic network path specific to the unstructured 3D environment during training of the RL agent.

6. The system of claim 5, wherein a positive action utility value is assigned to the trained RL agent for an applied action when distance between the end effector and the target object is less than or equal to a pre-defined threshold.

7. The system of claim 5, wherein a negative action utility value is assigned to the trained RL agent for an applied action when distance between the end effector and the target object is greater than a pre-defined threshold.

8. The system of claim 5, wherein the one or more hardware processors (104) are further configured by the instructions to: create one or more composite images from a first set of stereo images captured by a first image capturing device and a second set of stereo images captured by a second image capturing device, wherein view of each image from the first set of stereo images is unique from view of each image from the second set of stereo images and comprises the unstructured 3D environment, and
wherein the one or more composite images are created by:
resizing an input dimension pertaining to the first set of stereo images and the second set of stereo images to obtain a set of resized images;
converting the set of resized images to one or more YUV color space images;
extracting Y channel from the one or more YUV color space images to obtain one or more extracted Y channels; and
concatenating one or more vertical dimensions of the one or more extracted Y channels to create the one or more composite images. ,

Documents

Application Documents

#	Name	Date
1	201721037060-STATEMENT OF UNDERTAKING (FORM 3) [18-10-2017(online)].pdf	2017-10-18
2	201721037060-REQUEST FOR EXAMINATION (FORM-18) [18-10-2017(online)].pdf	2017-10-18
3	201721037060-FORM 18 [18-10-2017(online)].pdf	2017-10-18
4	201721037060-FORM 1 [18-10-2017(online)].pdf	2017-10-18
6	201721037060-DRAWINGS [18-10-2017(online)].pdf	2017-10-18
7	201721037060-COMPLETE SPECIFICATION [18-10-2017(online)].pdf	2017-10-18
8	201721037060-FORM-26 [22-11-2017(online)].pdf	2017-11-22
9	201721037060-Proof of Right (MANDATORY) [14-12-2017(online)].pdf	2017-12-14
10	201721037060-ORIGINAL UNDER RULE 6 (1A)-FORM 1-21-12-2017.pdf	2017-12-21
11	ABSTRACT 1.jpg	2018-08-11
12	201721037060-ORIGINAL UNDER RULE 6 (1A)-FORM 26-241117.pdf	2018-08-11
13	201721037060-FER.pdf	2020-05-04
14	201721037060-OTHERS [04-11-2020(online)].pdf	2020-11-04
15	201721037060-FER_SER_REPLY [04-11-2020(online)].pdf	2020-11-04
16	201721037060-COMPLETE SPECIFICATION [04-11-2020(online)].pdf	2020-11-04
17	201721037060-CLAIMS [04-11-2020(online)].pdf	2020-11-04
18	201721037060-US(14)-HearingNotice-(HearingDate-11-12-2023).pdf	2023-11-16
19	201721037060-FORM-26 [04-12-2023(online)].pdf	2023-12-04
20	201721037060-FORM-26 [04-12-2023(online)]-1.pdf	2023-12-04
21	201721037060-Correspondence to notify the Controller [04-12-2023(online)].pdf	2023-12-04
22	201721037060-Response to office action [11-12-2023(online)].pdf	2023-12-11
23	201721037060-Written submissions and relevant documents [19-12-2023(online)].pdf	2023-12-19
24	201721037060-PatentCertificate28-12-2023.pdf	2023-12-28
25	201721037060-IntimationOfGrant28-12-2023.pdf	2023-12-28

Search Strategy

1	Searchstrategy201721037060E_04-05-2020.pdf