Abstract: Manipulating a target object using non-prehensile actions presents a complex problem that requires addressing multiple constraints, such as finding collision-free trajectories, modeling the object dynamics, and speed of execution. Previous approaches have relied on contact-rich manipulation, where the object moves predictably while attached to the manipulator, thereby avoiding the need to model the object’s dynamics. However, this assumption may not always hold, and different end-effectors may be necessary depending on the use case. A modular framework that disentangles it into two smaller components is disclosed, which comprises a high-level planner for determining a collision free trajectory from the current to a goal position, and a low-level RL planner. The low-level RL planner feeds on purely object-centric representations and has an object-centric action space, thus making it context agnostic resulting in reduced training process.
DESC:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
MODULAR FRAMEWORK FOR OBJECT REARRANGEMENT IN COMPLEX SCENES USING NON-PREHENSILE ACTIONS
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from provisional patent application number 202321057257 filed in India on 25 August 2023, the entire contents of the aforementioned application are incorporated herein by reference.
TECHNICAL FIELD
The embodiments herein generally relate to the field of task management using robotics and, more particularly, to a method and system for modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions.
BACKGROUND
Non-prehensile action is an important aspect of object manipulation, for instance, to manipulate heavy objects or objects too wide to grasp. Conventional approaches in robotics addressing the non-prehensile manipulation mostly use Deep Reinforcement Learning (DRL), trained in an end-to-end fashion, to perform non-prehensile pushing. The end to end approach requires training of the DRL networks to be context aware requiring longer training times. Thus, the approach is time and computation inefficient.
Non-prehensile manipulation in an end effector agnostic manner that assumes the object dynamics to be independent of the end-effector dynamics is technically challenging. The primary challenge in developing a policy to execute such an action is the stochastic outcome of pushing an object: in the absence of privileged information like friction, and the weight of the target object, it is impossible to accurately predict the dynamics of the object on a push action. To tackle this, several works in the art have tried to explicitly model the outcome by training a deep neural network on a large amount of push-outcome pairs collected via simulation. However, such explicit modeling results in low generalization making multistep long-horizon planning of pushing an object from start to goal challenging.
In recent years, Deep reinforcement learning (DRL) has shown promising results in various robotics applications. RL algorithms learn to perform a task by maximizing a reward signal, which can be a scalar value that reflects the task’s success or failure; or a dense reward indicating the distance from the goal.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
For example, in one embodiment, a method for modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions is provided.
The method includes receiving, by a high-level planner executed by one or more hardware processors of a robotic agent, an environment information comprising (i) a current position of an object among a plurality of objects positioned within a scene of interest, (ii) a goal position of each of the plurality of objects, (iii) shape, location, and size of each of the plurality of objects and a plurality of obstacles in the scene of interest during movement of the object.
Further, the method includes computing, by the high-level planner executed by the one or more hardware processors, an optimal collision free trajectory comprising a plurality of waypoints for moving the object from the current position to the goal position traversing through the plurality of obstacles, where the current position corresponds to an initial way point, and the goal position corresponds to a final way point among the plurality of waypoints.
Further the method includes estimating, by a low-level Reinforcement Learning (RL) planner executed by one or more hardware processors of the robotic agent, utilizing an object centric action space, object dynamics in terms of a push and a push- direction required to move the object from the initial waypoint to a successive waypoint among the plurality of way points, wherein the robotic agent maneuvers a robotic arm in accordance with the estimated push and push direction providing non-prehensile actions to move the object to the successive way point executing an immediate short goal, and wherein the low-level RL planner is trained using a context agnostic training with collision problem solved by the optimal collision free trajectory computed by the high-level planner.
Furthermore the method includes iteratively estimating by the low-level RL planner the push and the push direction successively towards each of the plurality of way points until the object reaches the final way point at the goal position.
In another aspect, a system for modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instruction to receive, by a high-level planner executed by one or more hardware processors of a robotic agent, an environment information comprising (i) a current position of an object among a plurality of objects positioned within a scene of interest, (ii) a goal position of each of the plurality of objects, (iii) shape, location, and size of each of the plurality of objects and a plurality of obstacles in the scene of interest during movement of the object.
Further, the system computes via the high-level planner executed by the one or more hardware processors, an optimal collision free trajectory comprising a plurality of waypoints for moving the object from the current position to the goal position traversing through the plurality of obstacles, where the current position corresponds to an initial way point, and the goal position corresponds to a final way point among the plurality of waypoints.
Further, the system estimates via by a low-level Reinforcement Learning (RL) planner executed by one or more hardware processors of the robotic agent, utilizing an object centric action space, object dynamics in terms of a push and a push- direction required to move the object from the initial waypoint to a successive waypoint among the plurality of way points, wherein the robotic agent maneuvers a robotic arm in accordance with the estimated push and push direction providing non-prehensile actions to move the object to the successive way point executing an immediate short goal, and wherein the low-level RL planner is trained using a context agnostic training with collision problem solved by the optimal collision free trajectory computed by the high-level planner.
Furthermore, the system iteratively estimates via the low-level RL planner the push and the push direction successively towards each of the plurality of way points until the object reaches the final way point at the goal position.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions
The method includes receiving, by a high-level planner executed by one or more hardware processors of a robotic agent, an environment information comprising (i) a current position of an object among a plurality of objects positioned within a scene of interest, (ii) a goal position of each of the plurality of objects, (iii) shape, location, and size of each of the plurality of objects and a plurality of obstacles in the scene of interest during movement of the object.
Further, the method includes computing, by the high-level planner executed by the one or more hardware processors, an optimal collision free trajectory comprising a plurality of waypoints for moving the object from the current position to the goal position traversing through the plurality of obstacles, where the current position corresponds to an initial way point, and the goal position corresponds to a final way point among the plurality of waypoints.
Further the method includes estimating, by a low-level Reinforcement Learning (RL) planner executed by one or more hardware processors of the robotic agent, utilizing an object centric action space, object dynamics in terms of a push and a push- direction required to move the object from the initial waypoint to a successive waypoint among the plurality of way points, wherein the robotic agent maneuvers a robotic arm in accordance with the estimated push and push direction providing non-prehensile actions to move the object to the successive way point executing an immediate short goal, and wherein the low-level RL planner is trained using a context agnostic training with collision problem solved by the optimal collision free trajectory computed by the high-level planner.
Furthermore the method includes iteratively estimating by the low-level RL planner the push and the push direction successively towards each of the plurality of way points until the object reaches the final way point at the goal position.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1A is a functional block diagram of a system, also referred to as a robotic agent, for modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions, in accordance with some embodiments of the present disclosure.
FIG. 1B illustrates an architectural overview of the system of FIG. 1A, in accordance with some embodiments of the present disclosure.
FIG. 1C illustrates a task performed by the system to move an object from start to goal position in a cluttered environment using non-prehensile actions, in accordance with some embodiments of the present disclosure.
FIG. 2 is a flow diagram illustrating a method for modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions using the system depicted in FIG. 1A and 1B, in accordance with some embodiments of the present disclosure.
FIGS. 3A and 3B depicts a comparison between a global Reinforcement Learning (RL) planner and the modular framework process flow for object rearrangement, in accordance with some embodiments of the present disclosure.
FIG. 4 depicts a comparison between the global RL planner and the modular framework approach for object rearrangement based on experimental results, in accordance with some embodiments of the present disclosure.
FIG. 5A through 5D depict a graphical comparison between trajectories predicted by the global RL planner and the modular framework approach for object rearrangement based on experimental results, in accordance with some embodiments of the present disclosure.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Non-prehensile action is an important aspect of object manipulation, for instance, to manipulate heavy objects or objects too wide to grasp. Push-based non-prehensile manipulation can be divided into push-to-grasp, such as pushing objects in clutter to make them graspable or sliding an object to the edge of the table and push-to-goal to push an object from a start to a goal position. The push to goal work can further be classified as contact-rich manipulation either by sliding by the top or by the side.
Reinforcement Learning (RL)-based methods have also shown promising results in performing non-prehensile object manipulation tasks. For instance, one of the prior works in literature proposes a hierarchical RL framework that combines the advantages of both model-based and model-free methods for non-prehensile object manipulation. However, in this work, the high-level planner is coupled with the low-level planner and object dynamics, which could lead to noisy estimates when applied to long-distance planning. In such a case, the plan has to be recalculated at an appropriate frequency to reduce noise. This would be computationally expensive.
Embodiments herein provide a method and system providing a modular framework for an object rearrangement by a robotic agent in complex scenes using non-prehensile actions. The method disclosed addresses push-to-goal through push-by-striking manipulation. Push-by-striking loses the contact-rich assumption and thus disentangles the dynamics of the object from the manipulator dynamics removing the constraints in the type of end-effector – as the end-effector strikes an object, the outcome of the object state is less susceptible to the end-effector state or type. To this, the method utilizes a high level planner for collision free trajectory computation to achieve the end to end goal, and a collision-agnostic low-level Deep Reinforcement Learning (DRL) module having shorter goal to reach successive points of the collision free trajectory received as input.
One of the recent approaches, Learning Goal-Oriented Non-Prehensile Pushing in Cluttered Scenes by Nils Dengler et al., proposes a modular approach but is unable to fully leverage the advantages of this modularity. The design of modules used by above prior art take higher training time and are usually more unstable to train.
The prior work by Nils Dengler et al., has a collision-aware RL module (not context agnostic), and hence takes images as input, whereas the low-level RL planner of the method disclosed herein has object position and orientation as inputs. As a result, the method disclosed is able to create a compact state representation which makes the low-level RL efficient as a result of reduced observation space in a meaningful way.
Furthermore, the prior work’s RL module is end-effector-centric. That is, the RL module predicts actions in the end-effector space as (?x, ?y, ??), and moves the end-effector accordingly. Whenever the end-effector interacts with the object, the RL module of the above priori art receives the reward. Since the frequency of interaction is much less than the frequency of action prediction, the RL agent gets very less feedback from the environment, thus leading to unstable and long training phase.
The low-level RL planner disclosed herein is designed to be object-centric, which predicts the direction of push for the given object at every time step. In this case, the frequency of interaction is equal to frequency of action prediction, leading to a high feedback rate from the environment (as every interaction leads to a reward), and hence more stable and fast training. The object centric approach is possible because of the modular framework disclosed by the method, wherein the high-level planner is already taking care of collisions, enabling the low-level planner to just focus on learning to interact and push the object to intermediate waypoints. Thus, the method disclosed herein leverages modularity to full extent by making the low-level RL controller object centric as opposed to the prior work, thus leading to much more stable and fast training phase.
The modular framework enables to disentangle a large problem into a set of small independent problems, which makes it easy for RL to converge, due to the smaller state space that comes with the reduced problem size. The high level planner, which is the global end to end path planning module, generates a feasible trajectory for the robotic agents (robot’s) end-effector, while the low-level RL controller learns to control the robot’s movements to achieve the desired task. In the modular framework disclosed herein the low-level RL planner acts as an efficient mechanism for understanding the dynamics of an object on different push actions. Here, the RL controller is trained using a simple objective to push the object quickly to a given goal location. In this setting, the low-level controller is unaware of the collision objects and is dependent on a high-level planning module to obtain an optimal collision-free trajectory. Compared to training a global RL model for non-prehensile object manipulation, the modular framework can have several advantages:
(1) Improved sample efficiency: Non-prehensile object manipulation tasks typically require a large number of samples to train an RL model effectively. A modular framework can reduce the stress on the RL model by reducing the size of the problem it has to tackle.
(2) Better interpretability: A modular approach separates the control of the task-specific plan and the low-level actions, making it easier to understand how the system operates and diagnose issues. In contrast, a global RL model can be opaquer and more difficult to interpret.
(3) Ability to handle constraints: Non-prehensile object manipulation tasks often have constraints, such as avoiding collisions or maintaining balance. A modular approach can incorporate these constraints into the high-level planning
module and use them to guide the low-level actions. In contrast, a global RL model may struggle to handle all the constraints effectively, as designing an appropriate reward can be tricky.
Referring now to the drawings, and more particularly to FIGS. 1A through 5D, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1A is a functional block diagram of a system 100 providing modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions, in accordance with some embodiments of the present disclosure.
In an embodiment, the system 100, also referred to as a robotic agent, includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, and the like that can control a robotic arm for object movement.
The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, an ego view camera to capture the environment view and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting to a number of external devices or to another server or devices.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
In an embodiment, the memory 102 includes a plurality of modules 110 such as a high level planner , a low-level RL planner and the like. Further, the plurality of modules 110 include programs or coded instructions that supplement applications or functions performed by the system 100 for executing different steps involved in the process of object rearrangement by a robotic agent in complex scenes using non-prehensile actions, being performed by the system 100. The plurality of modules 110, amongst other things, can include routines, programs, objects, components, and data structures, which performs particular tasks or implement particular abstract data types. The plurality of modules 110 may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 110 can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. The plurality of modules 110 can include various sub-modules (not shown).
Further, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system100 and methods of the present disclosure.
Further, the memory 102 includes a database 108. The database (or repository) 108 may include a plurality of abstracted pieces of code for refinement and data that is processed, received, or generated as a result of the execution of the plurality of modules in the module(s) 110.
Although the data base 108 is shown internal to the system 100, it is to be noted that, in alternate embodiments, the database 108 can also be implemented external to the system 100, and communicatively coupled to the system 100. The data contained within such an external database may be periodically updated. For example, new data may be added into the database (not shown in FIG. 1A) and/or existing data may be modified and/or non-useful data may be deleted from the database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). Functions of the components of the system 100 are now explained with reference to steps in flow diagrams in FIG. 2 through FIG. 5D.
FIG. 1B illustrates an architectural overview of the system 100 of FIG. 1A, in accordance with some embodiments of the present disclosure. The Modular planning framework to solve the task of non-prehensile object manipulation in a cluttered scene utilizes the low-level RL controller that is trained to reach a short goal. The high-level planning module predicts a set of collision-free waypoints for the object. These waypoints are fed as goals to the low-level RL, which then helps find the best push direction to execute.
FIG. 1C illustrates a task performed by the system 100 to move an object from start to goal position in a cluttered environment using non-prehensile actions, in accordance with some embodiments of the present disclosure. The top image of FIG. 1C visualizes the trajectory executed by the high level planner in combination with the local low-level RL planner. The bottom part of the FIG. 1C shows the same trajectory using a graphical representation. The intended trajectory is indicated by two straight lines connecting at a common turning point. The executed trajectory (with triangle symbol) is a path closely following the intended trajectory.
FIG. 2 is a flow diagram illustrating a method 200 for modular framework for object rearrangement by a robotic agent in complex scenes using non-prehensile actions using the system depicted in FIG. 1A and 1B, in accordance with some embodiments of the present disclosure.
In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1A and 1B and the steps of flow diagram as depicted in FIG. 2. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
Referring to the steps of the method 200, at step 202 of the method 200, the high-level planner executed by the one or more hardware processors 104 receives environment information comprising (i) a current position of an object among a plurality of object positioned within a scene of interest, (ii) a goal position of the object, (iii) shape, location, and size of the object and a plurality obstacles in the scene during movement of the object.
At step 204 of the method 200, the high-level planner executed by the one or more hardware processors 104 computes an optimal collision free trajectory comprising a plurality of waypoints for moving the object from the current position to the goal position traversing through the obstacles. The current position corresponds to an initial way point, and the goal position corresponds to a final way point among the plurality of waypoints. Thus the high level planner is aware of the context or environment in which the task is to be performed. In an embodiment, the high level planner is based on the A* graph traversal technique, wherein the plurality of way points correspond to a plurality of nodes of a graph for the A* graph traversal technique. A heuristic function used by the A* graph traversal technique is an Euclidean distance to the goal position. An edge cost between two nodes among the plurality of nodes is the sum of the Euclidean distance between them and a collision cost.
At step 206 of the method 200, the low-level planner executed by the one or more hardware processors 104 estimates by utilizing an object centric action space object dynamics in terms of a push and a push-direction required to move the object from the initial waypoint to a successive waypoint among the plurality of way points. The robotic agent maneuvers a robotic arm in accordance with the estimated push and push-direction providing non-prehensile actions to move the object to the successive way point executing an immediate short goal. The low-level RL planner is trained using context agnostic training with collision problem solved by the optimal collision free trajectory computed by the high level planner. The low-level RL planner is a three-layered Multilayer perceptron (MLP), with 128 units per layer, wherein the first two layers are activated by a Rectified Linear Unit (ReLU) activation function, while a third layer does not use activation function. The low-level RL planner uses a Goal Distance Reward (GDR) which is a difference between a previous distance to the goal position and a current distance to the goal position, and a Goal Reward (GR) given when the object reaches the goal position.
At step 208 of the method 200, the low-level planner executed by the one or more hardware processors 104 iteratively estimates the push and the push direction successively towards each of the plurality of way points until the object reaches the final way point at the goal position.
Mathematical representation and detailed explanation of the method 200
To showcase the enhanced performance of modular approach, an additional end to end global planner is built and compared with the modular approach.[]
The method 200 and the global planner is now explained with mathematical models for Machine learning ( ML) models that include the high-level planner and the low-level RL planner and corresponding model training with reward function used.
A. Task Setup: The environment is a planar surface of dimensions of 0.5m × 0.5m, upon which two movable collision objects are randomly placed. The target object is a rectangular block with varying dimensions, where length, breadth ? [0.08,0.12], height ? [0.02,0.05], and the robot manipulator is a Universal Robots (UR5)™ arm with a closed two-fingered gripper. The manipulator interacts with the environment by striking the target object, using the tip of the arm. The goal is specified as a 3D pose of the pushed object.
B. Framework 1: Global RL: The global RL planner is a closed-loop goal-conditioned planner, which learns to push an object from a given initial position to a final desired position in a collision-free way. This is defined as a Markov Decision Process (MDP) with states st ? S, actions at ? A, goals ? ? G, and reward function r : S × A ? R. The state space st = (ht,lt) consists of a height map ht of the scene and a three-dimensional vector lt = (xc,yc,yawc), denoting the initial position, of the object that needs to be pushed, with respect to the global frame. The action space consists of 16 discrete actions, where the ith action denotes a push vector of fixed length in free space, which makes an angle2pi/16 with the x-axis of the global frame. This policy is trained on cuboidal objects, where each cuboid is represented as a set of 8 points (4 corners, and 4 mid-points). When a push vector is chosen, the point among the 8 points which offers the lowest torque about the object’s center of mass when pushed in that direction, is chosen as the start point for pushing. A goal ? ? G is defined by a 3-dimensional vector, that denotes the goal pose ((xg,yg,yawg)) of the object with respect to the global frame.
The objective is to find a goal-conditioned policy p(at|st,?) that maximizes the return R_t =?_(k=t)^inf¦?^(k-t) r(s_k,t_k). An optimal policy is found by modeling the Q function using a Deep Q-learning Network (DQN). Since the policy is goal-conditioned, the Q-function must also be goal-conditioned, given by Qp(st,at,?). Therefore, the DQN takes both the state st and goal ? as input and returns the Q-value for each of the 16 actions. The policy is then given by:
pt(st,?) = argmaxQ(at|st,?) (1)
FIG. 3A illustrates the architecture of the global planner consisting of three main modules: CNN1, MLP1, MLP2. The height map of dimensions (224, 224, 1) is first passed to CNN1, a three-layered convolutional network, which processes it and outputs a (13, 13, 64) map. The MLP1 module, which is a two-layered MLP, takes a 6-dimensional (lt,?) = (xc,yc,yawc,xg,yg,yawg) as input and gives a 16-dimensional output. The output from CNN1 is flattened and appended to the output from MLP1, to form a unified 10832-dimensional vector, which is then passed into MLP3, a three-layered MLP, to get a 16-dimensional vector as output, which denotes the q-values for each of the 16 actions.
Each convolutional layer in CNN1 is followed by batch normalization, activation by the ReLU function, and max pooling. All the linear layers in MLP1 and MLP2, except for the output layer, also use the same ReLU as their corresponding activation functions.
A Huber loss is used to train the DQN. The double DQN algorithm known in the art is followed, where two identical networks are used, one for collecting the data (target network) and the other for the training (policy network). Instead of using a hard-update rule, using a soft-update rule as used in DDPG[full form?] , given by:
tNet = pNet * t + tNet * (1 - t) (2)
where tNet = target network, pNet = policy network, and t is set to 0.005 to reduce the variance in training data.
The reward function consists of the following four components:
Goal Distance Reward (GDR): This is the difference between the previous distance to the goal and the current distance to the goal, given by:
GDR(sn,sn-1,?) = D2G(sn-1,?) - D2G(sn,?) (3)
Here, D2G(sn-1,?) = ? lk (0 : 2) - ?(0 : 2)?
Collision Reward (CR): This is the difference between the obstacle positions before and after the application of push action, given by:
CR(sn-1,sn) = ?_i¦??obs?_(n-1) i- ?obs?_n i? (4)
Here, ?obs?_(k-1) i denotes the pose of the ith obstacle during the state sk
Goal Reward (GR): This is the reward that is given when the object reaches its goal state. This is given by:
GR?(s?_n)={¦(+100,?s_n- ??=0.05@0, otherwise)¦ (5)
Out-of-bounds penalty (OOB): This is the reward that is given when the object goes out of bounds
OOB?(s?_n)={¦(-100,if object goes out of bounds@0, otherwise)¦ (6)
This model is then trained on 10000 episodes (about 183000 iterations).
C. Modular Framework (method 200): A* (high-level planner + Low-level RL): The high-level planner is the path planning module, which generates an optimal collision-free trajectory for the robot’s end-effector, while the low-level RL planner or controller controls the robot’s movements to accomplish the task objectives. FIG. illustrates the proposed approach.
1) Low-level RL Controller: The low-level RL controller learns to simply push a target object from a start to a goal location on a tabletop scene. A controller is trained without any obstacles on the table and is optimized to approach the goal location quickly. The MDP formulation is identical to the one designed for global RL. also use the same action space. However, the state is simple from st = (ht,lt) to st = (lt). The reward function for low-level RL involves only the Goal Distance Reward (GDR) and the goal reward (GR) mentioned in the global RL section. A discount factor ? = 0.9 is used, which instructs the policy to reach the goal as quickly as possible.
A DQN in a similar way to global RL, with a target network for exploration and a policy network for learning from experience. The optimal policy is then given by:
p(st,?) = argmaxQ(st,at,?) (7)
FIG. 3B illustrates the proposed architecture for low-level RL. A simple three-layered MLP, with 128 units per layer. The first two layers are activated by ReLU, while the third layer does not use any activation function. The overall process flow is depicted as explained in FIG. 1B.
The high-level planner- implemented using A* Path Planning Module generates a feasible minimum collision trajectory for the pushed object, given the initial and final configurations, along with shapes, locations, and sizes of the obstacles in the scene. A graph is defined with a minimum distance of 0.05 meters between adjacent nodes, bounded by the limits of the workspace. Every node has 5 equally spaced adjacent neighbors separated by an angle of 360/5deg = 72deg. These parameters have been set to improve our speed of convergence for a given scene but can be reset based on the trade-off between accuracy and speed.
An Euclidean distance is used to the goal as the heuristic function. The edge cost between two nodes is the sum of the Euclidean distance between them and the collision cost, if any.
The collision radius for collision object i is the maximum distance from its centroid to its boundary added to the maximum distance between the pushed object’s centroid and boundary. di is the maximum distance between the pushed object’s centroid to its boundary. The collision cost is mathematically defined as: Collision Cost =?_i^n¦{¦(R_i^c- d_i ? d?_i ?=R?_i^c@0 otherwise)¦ ,
wherein a collision radius R_i^c for a collision object i is a maximum distance from the centroid of the collision object to a boundary of the collision object added to a maximum distance between the centroid and the boundary of the object being pushed, and wherein d_i is a maximum distance between the centroid and the boundary of the object being pushed. Here, n = number of collision objects in the scene Therefore, the method 200 (modular framework) predicts waypoints that reach the goal with minimum distance covered and with the lowest collision rate possible.
EXPERIMENTS: A simulation environment was built for the task using PyBullet™, as shown in FIG. 4. The environment contained a UR5 arm™ with a closed two-fingered gripper, a table, and a set of objects on top of the table. The focus was on pushing standard cuboidal objects, with obstacles in three different shapes: cubes, cuboids, and cylinders, placed in valid randomized locations. The global RL was the DQN that is trained in simulation for 10000 episodes, which is equivalent to 183000 environment steps/iterations. The low-level RL, with zero discount factor, is also the DQN, which was trained on 500 episodes, which is equivalent to 5350 environment steps/iterations. (Ablation: Low-level DQN with non-zero discount factor requires more training time and iterations, trained for 120k, and performs really well). The training was done on a single NVIDIA GeForce GTX 1080 Ti GPU, with 10 CPUs. 12 distinct experiments and complex scenes are designed to evaluate the performance of these models. These scenes are shown in FIG. 4.
Metrics: For performance evaluation the focus was on the collisions and the success rate, therefore, three metrics are defined that include: Average Positional Disturbance, Average Angular Disturbance, and Success rate.
Average Positional Disturbance (APD): For a given scene, let poseinit(i) be the initial 2D cartesian pose of the object i, and posefinal (i) be the final 2D cartesian pose calculated at the end of the episode. Then, the average positional disturbance for a given scene j is calculated as follows: AAD(j)=?_i¦??pose?_final (i)- ?pose?_init (i)?
The average of APD over a set of scenes and report it as Average APD.
Average Angular Disturbance (AAD): For a given scene, let yawinit(i) be the initial yaw of the object i, and yawfinal(i) be the final yaw calculated at the end of the episode. Then, the average angular disturbance for a given scene j is calculated as follows: AAD(j)=?_i¦??yaw?_final (i)- ?yaw?_init (i)?
The average of AAD is then calculated over a set of scenes and reported it as Average AAD.
Success Rate: If the concerned object reaches its goal without going out of bounds, then, it is counted as a success. The success rate is reported as the number of successful episodes divided by the total number of episodes.
a Global RL framework and an A* + Low-level RL in terms of Success Rate, Average APD across all 12 experiments, and Average AAD across all experiments.
Qualitative Analysis: FIGS. 5A and 5C display the A* path. The shape of the A* trajectory is a consequence of collision constraints between the pushed object and the obstacles on the table. The low-level RL planning controller effectively moves the item to the objective (current position to the goal position) without colliding by following the path fed by A*, as shown by the trajectory with triangular symbols in FIGS. 5A and 5C, whereas FIGS. 5B and 5D depict the Global RL trajectory for the same arrangement of start (current position), goal position, and obstacle poses. While the global RL seems to compute shorter plans to the goal, there are collisions between the pushed object and other obstacles on the table. Thus, A* + low-level RL is seen to perform better than global RL, despite being trained for a fraction of global RL’s training time.
Trajectories predicted by A* ( high-level planner) and executed by low-level RL planner are directly compared against trajectories executed by Global RL. Here, the solid fill shapes denote obstacles, the outlines denote the positions and orientations of the obstacles (wherein the outlines are visible when obstacles are displaced from original position due to collision while the object is pushed from the current position to the goal position)
The collision aware A* (high-level planner) is able to always find the trajectory with minimum collision cost in FIG. 5A and 5C and the low-level RL planner is shown to fit well into this trajectory. On the other hand, Global RL finds a trajectory that may reach the goal faster as in FIG. 5B and 5D but has a higher collision rate, as shown qualitatively by the outlines denoting the displacement of the collision objects.
Quantitative Analysis: The performance of the global planner against the modular framework is depicted using the following three metrics: the Average APD, the Average AAD, and the success rate. Table 1 shows the metrics obtained for the global RL, and the A* + low-level RL, by running them on 12 well-designed scenes as shown in Table 1. As seen, the A* + low-level RL outperforms global RL on all the metrics, despite being trained for just a few hours, as opposed to global RL, which was trained for approximately four days.
Table1:
Success Rate (%) Avg APD (×10-3) Avg AAD
Global RL 62.5 53.07 0.53
A* +
Low-level RL 100 6.06 0.49
Thus the method and system disclosed herein disentangled the end to end problem of moving object from current to goal position without collision with obstacles into two smaller problems addressed by two independent modules, A* ( high-level planner) and low-level RL planner and indicates that such a disentanglement helps speed up training and adaptation to new contexts. Disentangling the collision-free path planning + object control into two separate modules helped make the task easy for the RL model. Thus, they are able to achieve a performance better than the global RL with just a few hours of training. Such a disentanglement also makes the model more transparent and interpretable. In future scope, to improve interpretability the A* can be replaced with a learning-based module that takes the behavior of low-level RL into account while planning, would form an important future direction, as this would move us in the direction of interpretable and transparent entanglement, which combines the advantages of global RL and the disentangled framework.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
,CLAIMS:
A processor implemented method for non-prehensile object rearrangement, the method comprising:
receiving, by a high-level planner executed by one or more hardware processors of a robotic agent, an environment information comprising (i) a current position of an object among a plurality of objects positioned within a scene of interest, (ii) a goal position of each of the plurality of objects, (iii) shape, location, and size of each of the plurality of objects and a plurality of obstacles in the scene of interest during movement of the object;
computing, by the high-level planner executed by the one or more hardware processors, an optimal collision free trajectory comprising a plurality of waypoints for moving the object from the current position to the goal position traversing through the plurality of obstacles, where the current position corresponds to an initial way point, and the goal position corresponds to a final way point among the plurality of waypoints;
estimating, by a low-level Reinforcement Learning (RL) planner executed by one or more hardware processors of the robotic agent, utilizing an object centric action space, object dynamics in terms of a push and a push- direction required to move the object from the initial waypoint to a successive waypoint among the plurality of way points, wherein the robotic agent maneuvers a robotic arm in accordance with the estimated push and push direction providing non-prehensile actions to move the object to the successive way point executing an immediate short goal, and wherein the low-level RL planner is trained using a context agnostic training with collision problem solved by the optimal collision free trajectory computed by the high-level planner; and
iteratively estimating by the low-level RL planner the push and the push direction successively towards each of the plurality of way points until the object reaches the final way point at the goal position.
The method as claimed in claim 1,
wherein the high-level planner is based on A* graph traversal technique, wherein the plurality of way points correspond to a plurality of nodes of a graph for the A* graph traversal technique,
wherein a heuristic function used by the A* graph traversal technique is Euclidean distance to the goal position, and
wherein an edge cost between two nodes among the plurality of nodes is a sum of the Euclidean distance between them and a collision cost.
The method as claimed in claim 2, wherein the collision cost is mathematically defined as: Collision Cost =?_i^n¦{¦(R_i^c- d_i ? d?_i ?=R?_i^c@0 otherwise)¦ ,
wherein a collision radius R_i^c for a collision object i is a maximum distance from the centroid of the collision object to a boundary of the collision object added to a maximum distance between the centroid and the boundary of the object being pushed, and wherein d_i is a maximum distance between the centroid and the boundary of the object being pushed.
The method as claimed in claim 1, wherein the low-level RL planner is a three-layered Multilayer perceptron (MLP), with 128 units per layer, wherein first two layers are activated by rectified linear unit (ReLU) activation function, while a third layer does not use activation function.
The method as claimed in claim 4, wherein the low-level RL planner uses a Goal Distance Reward (GDR) which is a difference between a previous distance to the goal position and a current distance to the goal position, and a Goal Reward (GR) given when the object reaches the goal position.
A system (100) for non-prehensile object rearrangement, the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive, by a high-level planner executed by one or more hardware processors of a robotic agent, an environment information comprising (i) a current position of an object among a plurality of objects positioned within a scene of interest, (ii) a goal position of each of the plurality of objects, (iii) shape, location, and size of each of the plurality of objects and a plurality of obstacles in the scene of interest during movement of the object;
compute, by the high-level planner executed by the one or more hardware processors, an optimal collision free trajectory comprising a plurality of waypoints for moving the object from the current position to the goal position traversing through the plurality of obstacles, where the current position corresponds to an initial way point, and the goal position corresponds to a final way point among the plurality of waypoints;
estimate, by a low-level Reinforcement Learning (RL) planner executed by one or more hardware processors of the robotic agent, utilizing an object centric action space, object dynamics in terms of a push and a push- direction required to move the object from the initial waypoint to a successive waypoint among the plurality of way points, wherein the robotic agent maneuvers a robotic arm in accordance with the estimated push and push direction providing non-prehensile actions to move the object to the successive way point executing an immediate short goal, and wherein the low-level RL planner is trained using a context agnostic training with collision problem solved by the optimal collision free trajectory computed by the high-level planner; and
iteratively estimate by the low-level RL planner the push and the push direction successively towards each of the plurality of way points until the object reaches the final way point at the goal position.
The system as claimed in claim 6,
wherein the high-level planner is based on A* graph traversal technique, wherein the plurality of way points correspond to a plurality of nodes of a graph for the A* graph traversal technique,
wherein a heuristic function used by the A* graph traversal technique is Euclidean distance to the goal position, and
wherein an edge cost between two nodes among the plurality of nodes is a sum of the Euclidean distance between them and a collision cost.
The system as claimed in claim 7, wherein the collision cost is mathematically defined as: Collision Cost =?_i^n¦{¦(R_i^c- d_i ? d?_i ?=R?_i^c@0 otherwise)¦ ,
wherein a collision radius R_i^c for a collision object i is a maximum distance from the centroid of the collision object to a boundary of the collision object added to a maximum distance between the centroid and the boundary of the object being pushed, and wherein d_i is a maximum distance between the centroid and the boundary of the object being pushed.
The system as claimed in claim 6, wherein the low-level RL planner is a three-layered Multilayer perceptron (MLP), with 128 units per layer, wherein first two layers are activated by rectified linear unit (ReLU) activation function, while a third layer does not use activation function.
The system as claimed in claim 9, wherein the low-level RL planner uses a Goal Distance Reward (GDR) which is a difference between a previous distance to the goal position and a current distance to the goal position, and a Goal Reward (GR) given when the object reaches the goal position.
| # | Name | Date |
|---|---|---|
| 1 | 202321057257-STATEMENT OF UNDERTAKING (FORM 3) [25-08-2023(online)].pdf | 2023-08-25 |
| 2 | 202321057257-PROVISIONAL SPECIFICATION [25-08-2023(online)].pdf | 2023-08-25 |
| 3 | 202321057257-FORM 1 [25-08-2023(online)].pdf | 2023-08-25 |
| 4 | 202321057257-DRAWINGS [25-08-2023(online)].pdf | 2023-08-25 |
| 5 | 202321057257-DECLARATION OF INVENTORSHIP (FORM 5) [25-08-2023(online)].pdf | 2023-08-25 |
| 6 | 202321057257-FORM-26 [17-10-2023(online)].pdf | 2023-10-17 |
| 7 | 202321057257-Proof of Right [15-02-2024(online)].pdf | 2024-02-15 |
| 8 | 202321057257-FORM 3 [13-03-2024(online)].pdf | 2024-03-13 |
| 9 | 202321057257-FORM 18 [13-03-2024(online)].pdf | 2024-03-13 |
| 10 | 202321057257-ENDORSEMENT BY INVENTORS [13-03-2024(online)].pdf | 2024-03-13 |
| 11 | 202321057257-DRAWING [13-03-2024(online)].pdf | 2024-03-13 |
| 12 | 202321057257-COMPLETE SPECIFICATION [13-03-2024(online)].pdf | 2024-03-13 |
| 13 | Abstract1.jpg | 2024-05-22 |