Method And System For Planning A Large Scale Object Rearrangement

< Back

Method And System For Planning A Large Scale Object Rearrangement Using Deep Reinforcement Learning

Abstract: This disclosure relates generally to method and system for planning a large-scale object rearrangement using deep reinforcement learning. Rearranging objects to clean an environment is a challenging problem in the field of robotics. Object rearrangement is about moving one or more objects from an initial state to a goal state through task and motion planning. The system receives a source plan and a target plan to replace the current state of each source object with corresponding goal state of each target object. Further, a trained parameterized deep Q network generates a task plan executed by a RL agent for achieving the goal state of each target object position by constructing a discrete action and a continuous action to reach the goal state of each target object from the current position. Further, the RL agent is assigned with a reward for every action performed in the rearrangement environment during training. [To be published with FIG. 3]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

19 July 2022

Publication Number

04/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th floor, Nariman point, Mumbai 400021, Maharashtra, India

Inventors

1. DAS, Dipanjan

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata 700160, West Bengal, India

2. BHOWMICK, Brojeshwar

3. GHOSH, Sourav

4. AGARWAL, Marichi

Tata Consultancy Services Limited 4 & 5th floor, PTI Building, No 4, Sansad Marg, New Delhi 110001, Delhi, India

5. CHAKRABORTY, Abhishek

Specification

Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:

METHOD AND SYSTEM FOR PLANNING A LARGE-SCALE OBJECT REARRANGEMENT USING DEEP REINFORCEMENT LEARNING

Applicant

Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
The disclosure herein generally relates to object rearrangement, and, more particularly, to a method and system for planning a large-scale object rearrangement using deep reinforcement learning.

BACKGROUND
In recent trends, embodied artificial intelligence (AI) have shown immense progress in developing models and algorithms enabling reinforcement learning (RL) agent to navigate for performing a task(s) within an environment. However, a typical assumption of such task(s) is static environment, where the RL agent moves within the environment but cannot interact with one or more objects or modify their state. The ability to interact with each object and changing the environment is crucial for any artificial embodied RL agent.
Automatic rearrangement of objects to clean up a room is a challenging problem in the field of robotics. Object rearrangement is about moving one or more objects from an initial state to a goal state through task and motion planning. Rearrangement requires understanding of the current state of each object, computing the difference between the current state and the goal state, and then planning a sequence of actions so that each object reach their specified goal state. For example, the goal state can be given through images or descriptions using language or through formal predicate-based specifications.
In one existing method, rearrangement of one or more objects in the room based on datasets and a baseline model instantiates the rearrangement problem using image goal specification. However, the said above method lacks in performing rearrangement of each object in large-scale object in a two-dimensional (2D) tabletop and a three-dimensional (3D) room scenarios with multiple blocked goals which requires selecting and placing actions to perform rearrangement of the one or more objects.
In another existing method, task and motion planning used a high-level task planner combined with a local motion planner by creating a single unified formulation using a classical sample based RRT algorithms or by using an Integer Programming to solve an optimization instance. This method failed to scale with the large objects rearrangements as the action sequence increased exponentially with the number of objects and do not produce solutions for collision instances with a requirement of an additional buffer space.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for planning a large-scale object rearrangement using deep reinforcement learning is provided. The system includes receiving by a trained reinforcement learning (RL) agent placed in a rearrangement environment, a source plan, and a target plan, wherein the source plan comprises of a current state of one or more source objects positioned in a three-dimensional (3D) space of the rearrangement environment, wherein the target plan comprises of a goal state of one or more target objects to be rearranged in the 3D space of the rearrangement environment. Further, the RL agent locates the current state of each source object and the goal state of each target object and predicting an action sequence to replace the current state of each source object with corresponding goal state of each target object. Further, a task plan is generated by utilizing a trained parameterized deep Q network, executed by the RL agent to rearrange the current state of each object for achieving the goal state of each target object position. The task plan is generated by constructing a discrete action and a continuous action to reach the goal state of each target object from the current position of the RL agent. The discrete action selects each source object from the current state blocking the goal state of corresponding target object and placing the selected object in a dynamic intermediate buffer space. The continuous action traces a destination coordinate from the current position of the RL agent to reach the goal state of each target object. The Q network of the parameterized deep Q network decides the blocked object to be selected from the current state. The parameter network P of the parameterized deep Q network predicts a continuous coordinate where the selected source object to be placed in the dynamic intermediate buffer space. Furthermore, the RL agent is assigned with a reward based on a reward structure for every action performed by the RL agent in the rearrangement environment. The reward structure comprises of an infeasible action reward, a static object reward, a collision resolution reward, a nearest neighbour reward, and a goal state reaching reward.
In accordance with an embodiment, the trained parameterized deep Q network enables the RL agent to perform action sequence on each target object location preoccupied by at least one of each source object and each target object present in a collided rearrangement environment and a non-collided rearrangement environment.
In accordance with an embodiment, the dynamic intermediate buffer space is predicted at runtime for positioning and swapping the current state of each source object with the corresponding goal state of each target object for avoiding collision occurrence.
In another aspect, a method for planning a large-scale object rearrangement using deep reinforcement learning is provided. The method includes receiving by a trained reinforcement learning (RL) agent placed in a rearrangement environment, a source plan, and a target plan, wherein the source plan comprises of a current state of one or more source objects positioned in a three-dimensional (3D) space of the rearrangement environment, wherein the target plan comprises of a goal state of one or more target objects to be rearranged in the 3D space of the rearrangement environment. Further, the RL agent locates the current state of each source object and the goal state of each target object and predicting an action sequence to replace the current state of each source object with corresponding goal state of each target object. Further, a task plan is generated by utilizing a trained parameterized deep Q network, executed by the RL agent to rearrange the current state of each object to achieve the goal state of each target object position. The task plan is generated by constructing a discrete action and a continuous action to reach the goal state of each target object from the current position of the RL agent. The discrete action selects each source object from the current state blocking the goal state of corresponding target object and placing the selected object in a dynamic intermediate buffer space. The continuous action traces a destination coordinate from the current position of the RL agent to reach the goal state of each target object. The Q network of the parameterized deep Q network decides the blocked object to be selected from the current state. The parameter network P of the parameterized deep Q network predicts a continuous coordinate where the selected source object to be placed in the dynamic intermediate buffer space. Furthermore, the RL agent is assigned with a reward based on a reward structure for every action performed by the RL agent in the rearrangement environment. The reward structure comprises of an infeasible action reward, a static object reward, a collision resolution reward, a nearest neighbour reward, and a goal state reaching reward.
In accordance with an embodiment, the trained parameterized deep Q network enables the RL agent to perform action sequence on each target object location preoccupied by at least one of each source object and each target object present in a collided rearrangement environment and a non-collided rearrangement environment.
In accordance with an embodiment, the dynamic intermediate buffer space is predicted at runtime for positioning and swapping the current state of each source object with the corresponding goal state of each target object for avoiding collision occurrence.
In yet another aspect, a non-transitory computer readable medium for receiving by a trained reinforcement learning (RL) agent placed in a rearrangement environment, a source plan, and a target plan, wherein the source plan comprises of a current state of one or more source objects positioned in a three-dimensional (3D) space of the rearrangement environment, wherein the target plan comprises of a goal state of one or more target objects to be rearranged in the 3D space of the rearrangement environment. Further, the RL agent locates the current state of each source object and the goal state of each target object and predicting an action sequence to replace the current state of each source object with corresponding goal state of each target object. Further, a task plan is generated by utilizing a trained parameterized deep Q network, executed by the RL agent to rearrange the current state of each object to achieve the goal state of each target object position. The task plan is generated by constructing a discrete action and a continuous action to reach the goal state of each target object from the current position of the RL agent. The discrete action selects each source object from the current state blocking the goal state of corresponding target object and placing the selected object in a dynamic intermediate buffer space. The continuous action traces a destination coordinate from the current position of the RL agent to reach the goal state of each target object. The Q network of the parameterized deep Q network decides the blocked object to be selected from the current state. The parameter network P of the parameterized deep Q network predicts a continuous coordinate where the selected source object to be placed in the dynamic intermediate buffer space. Furthermore, the RL agent is assigned with a reward based on a reward structure for every action performed by the RL agent in the rearrangement environment. The reward structure comprises of an infeasible action reward, a static object reward, a collision resolution reward, a nearest neighbour reward, and a goal state reaching reward.
In accordance with an embodiment, the trained parameterized deep Q network enables the RL agent to perform action sequence on each target object location preoccupied by at least one of each source object and each target object present in a collided rearrangement environment and a non-collided rearrangement environment.
In accordance with an embodiment, the dynamic intermediate buffer space is predicted at runtime for positioning and swapping the current state of each source object with the corresponding goal state of each target object for avoiding collision occurrence.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary block diagram of a system (alternatively referred to as object rearrangement system), in accordance with some embodiments of the present disclosure.
FIG.2A and FIG.2B illustrates a high-level overview of a trained parameterized deep Q network for rearranging objects present in a rearrangement environment using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG. 3 depicts a flow diagram illustrating a method to generate a task planning for rearranging objects present in a rearrangement environment by utilizing the parameterized deep Q network using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG. 4 illustrates an example visual rearrangement scenario of a drawing room executed by a reinforcement learning (RL) agent to replace a current state of each source object with corresponding goal state of each target object using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG.5 illustrates a dense reward structure assigned to the RL agent for every action performed in the visual rearrangement scenario using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG.6 illustrates an experimental results of example source plan of the visual rearrangement scenario representing a current state of one or more objects positioned in a three-dimensional (3D) space using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG.7 illustrates the experimental results example target plan of the visual rearrangement scenario representing a goal state of one or more objects to be positioned in the 3D space using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG.8 depicts exemplary experimental results illustrating visual rearrangement scenario executed by the trained parameterized deep Q network enabling the RL agent to perform action sequence on each target object location using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG.9 shows a comparison of the number of average returns when RL agent is trained with a sparse reward structure and the dense reward structure using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG.10 shows a comparison of the number of average returns versus a number of increasing steps performed by the RL agent to replace each object using the system of FIG.1, in accordance with some embodiments of the present disclosure. FIG.10 shows for ? = 0.04 gives the best average return.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Embodiments herein provide a method and system for planning a large-scale object rearrangement using deep reinforcement learning. The method disclosed, enables to generate a task plan to replace a current state of each source object present in a three-dimensional (3D) rearrangement environment with corresponding goal state of each target object utilizing a trained parameterized deep Q network. However, object rearrangement is moving one or more objects from the current state to the goal state through a task and motion planning. The method of the present disclosure obtains a source plan and a target plan of the rearrangement environment as input. The source plan comprises of the current state of one or more source objects positioned in the 3D space of the rearrangement environment. The target plan comprises of the goal state of one or more target objects to be rearranged in the 3D space of the rearrangement environment. The method generates the task plan feasible in discrete-continuous action space. The method is performed using a deep RL agent which provides a robust and efficient task plan execution for large-scale object rearrangement environment. Also, the RL agent is assigned a reward structure for every action performed. The performance of the RL agent is calculated with minimal moves which reduces overall traversal of the RL agent. The disclosed system is further explained with the method as described in conjunction with FIG.1 to FIG.10 below.
Referring now to the drawings, and more particularly to FIG. 1 through FIG.10, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary block diagram of a system (alternatively referred to as object rearrangement system), in accordance with some embodiments of the present disclosure. In an embodiment, the object rearrangement system 100 includes processor (s) 104, communication interface (s), alternatively referred as or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the processor (s) 104. The system 100, with the processor(s) is configured to execute functions of one or more functional blocks of the system 100. Referring to the components of the system 100, in an embodiment, the processor (s) 104 can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 104 is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, 10 hand-held devices, workstations, mainframe computers, servers, a network cloud, and the like.
The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting a number of devices (nodes) of the system 100 to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic-random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
FIG.2A and FIG.2B illustrates a high-level overview of a trained parameterized deep Q network for rearranging objects present in a rearrangement environment using the system of FIG.1, in accordance with some embodiments of the present disclosure. FIG.2A is an object rearrangement system 100 including a parameterized deep Q network 202, a reinforcement learning (RL) agent 204, and a rearrangement environment 206. The object rearrangement system 100 is configured to generate a task plan to replace the current state of each source object with corresponding goal state of each target object. The RL agent 204 is positioned in the rearrangement environment 206 to perform action sequence and a reward is assigned for every action step. The object rearrangement system 100 receives a source plan and a target plan to predict an action sequence to replace the current state of each source object with corresponding goal state of each target object in the rearrangement environment.
FIG.2B is the parameterized deep Q network 202 comprising of a state space 220, a parameter network P 224, and a Q network 226. The parameterized deep Q network 202 is a combinatorial network comprising of the parameter network P 224 and the Q network 226.
The state space 220 associated with the parameterized deep Q network 202 provides the current state of each source object and a current position of the reinforcement learning (RL) agent 204 for every action being performed to reach the goal state of each target object.
The parameter network P 224 associated with the parameterized deep Q network 202 predicts continuous coordinates where the selected object to be placed in the dynamic intermediate buffer space. The Q network 226 associated with the parameterized deep Q network 202 decides the blocked source object to be selected from the current state.
The parameterized deep Q network 202 generates an optimal move sequence for each source object and each object to transform the current state of each source object to be replaced with the target state of each target object. The hybrid action space supports a discrete action space and a continuous action space and utilizes sparse representation of the rearrangement environment as the state space of the Q network 226. The present disclosure is further explained with an example, where the system 100 generates the task plan being executed by the RL agent 204 to rearrange the goal state of each target object position with the current state of each source object using the system of FIG.1 and FIG.2.
Further, the memory 102 comprises information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. Functions of the components of system 100, for time series prediction of target variable, are explained in conjunction with FIG.3 through FIG.10 providing flow diagram, architectural overviews, and performance analysis of the system 100.
FIG. 3 depicts a flow diagram illustrating a method to generate a task planning for rearranging objects present in a rearrangement environment by utilizing the parameterized deep Q network using the system of FIG.1, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 300 by the processor(s) or one or more hardware processors 104. The steps of the method 300 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG.2 through FIG.10. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
Referring now to the steps of the method 300, at step 302, the one or more hardware processors 104 receive from a trained reinforcement learning (RL) agent placed in a rearrangement environment, a source plan, and a target plan. The source plan comprises of a current state of one or more source objects positioned in a three-dimensional (3D) space of the rearrangement environment. The target plan comprises of a goal state of one or more target objects to be rearranged in the 3D space of the rearrangement environment. Considering an example (FIG.4) where a drawing room scene is represented as the rearrangement environment 206.
FIG. 4 illustrates an example visual rearrangement environment for example a drawing room executed by a reinforcement learning (RL) agent 204 to replace a current state of each source object with corresponding goal state of each target object using the system of FIG.1, in accordance with some embodiments of the present disclosure The drawing room has one or more static source objects placed and the RL agent 204 is positioned in the rearrangement environment 206. The system 100 is capable of performing replacement of each target object in collision and non-collision environments. Here, the drawing room scene is the collision combined cluttered environment. The RL agent 204 performs moving action sequence for each source object to be replaced with each target object to be positioned.
Referring now to the steps of the method 300, at step 304, the one or more hardware processors 104 locate from the RL agent, the current state of each source object and the goal state of each target object and predicting an action sequence to replace the current state of each source object with corresponding goal state of each target object.
Referring to the above example (FIG.4), with the received source plan and the target plan, the RL agent 204 positioned in the rearrangement environment 206 locates the one or more source objects, and the one or more target objects present in the drawing room. The RL agent 204 of the system 100 devises predicting action sequence for replacement. This enables the system 100 to devise the task plan for the target plan to be executed for replacement.
In one embodiment, for the given example considering,
where,
?{O_i^s?}?_(i=1,2,…,M) represents a list of 3D positions of the one or more source objects present in the drawing room,
?{O_i^t?}?_(i=1,2,…,M) represents the list of 3D positions of the source and target state for a rearrangement setup for M objects.
The task is to device a task plan which generates a set of valid moves to transform each source object from the current source object state ?{O_i^s?}?_i of the target object state ?{O_i^t?}?_i. The combined action a_j = (m,p_m) to move each object {O_i^s} from the source state to {O_i^t} to the target state, where m ? [M] is a discrete action to specify which object to move, and p_m? R3 is the continuous action parameters to specify the target 3D position of the object m. Referring now to the collision instances where the target location of each target object {O_i^t} is preoccupied by other source object {O_i^s} . In such scenario there is a need to predict the available free 3D location to place each source object {O_j^s}. The method predicts the 3D target locations instead of copying it from {O_i^t} to support the collision instances without using explicit buffer location. The method generates a valid sequence of actions?{a_j}?_(j=1,2,…,N) that advances the transition from the current state of the source object ?{O_i^s}?_i, to the goal state ?{O_i^t}?_i of the target object. It is noted that the method utilizes an open-source platform for visual artificial intelligence (AI2Thor) inbuilt technology to get objects identifiers and positional information to compute the current state of each source object ?{O_i^s}?_i and the goal state of each target object?{O_i^t}?_i from the input source and goal images.
Further, the method of the present disclosure computes the translation matrix t of M × 3 dimension between each source object and each target objects 3D position and M dimensional collision vector {C} ? [0,1] to denote if an object finds its target goal blocked or is free analytically.
The matrix t and the vector C decompose the rearrangement state into more compact representation s_c and use this representation in the deep RL to predict each action ?{a_j}?_(j=1,2,…,N) which is described in the other embodiments of the present disclosure.
Referring now to the steps of the method 300, at step 306, the one or more hardware processors 104 generate a task plan by utilizing a trained parameterized deep Q network, executed by the RL agent to rearrange the current state of each object to achieve the goal state of each target object position. Further, the task constructs a discrete action and a continuous action to reach the goal state of each target object from the current position. The RL agent 204 performs the discrete action by selecting each source object from the current state blocking the goal state of corresponding target object and placing the selected source object in a dynamic intermediate buffer space. It is to be noted, that the method of the present disclosure utilizes available buffer space and does not use explicitly allocated buffer space for rearranging the current state of each source object with the goal state of each target object.
The RL agent 204 performs the continuous action by tracing a destination coordinate from the current position of the RL agent 204 to reach the goal state of each target object.
Further the Q network 226 of the parameterized deep Q network 202 decides the blocked object to be selected from the current state. The parameter network P 224 of the parameterized deep Q network 202 predicts continuous coordinates where the selected source object is placed in the dynamic intermediate buffer space.
The parameterized deep Q network 202 is provided with the state of each source object and each target object in terms of collision combined with relative translation between the current position and the target position. The RL agent 204 generates the task plan to rearrange the objects in a shorter time. In addition, the method uses available free spaces in the drawing room (FIG.4) intelligently as an intermediate buffer so that each source object corresponding to the blocked goals in those free spaces to see the long-term effect of the placement. The object selection sequence and efficient free space management for collision both are handled using the RL agent 204. Therefore, the discrete action is responsible for selecting each correct source object and the continuous action is responsible for finding out the destination coordinate.
The destination coordinate can be the target location or can be a free space coordinate in case of collision. The parametrized deep Q network 202 is responsible for handling both the collided and non-collided objects, and the continuous action space for placing the objects. The action a_j as (m,p_m) where m is the object index to be selected and p_m is the continuous coordinate (free space or target location) where each source object m is to be placed. The action space is hybrid which needs both discrete actions followed by a continuous action parameter. The Q network 226 (?_Q) to generate m and the parameter network P (?_p) 224 to generate the p_m for m. At each step t, the method chooses an action to advance to a next state ? s'?_c from ? s?_c, and receives a reward r(? s'?_c,a') according to a Markovian Decision Process (MDP). Further, the objective is to maximize the total expected reward express as the Bellman equation similar as given in Equation 1,
Q(? s?_c,m,? p?_m )= ¦(E@r,? s'?_c )[r+ ¦(?maxQ@m')(? s^'?_c ,m^',?_(p_m ) (? s^'?_c ))|s_c,m,p_m] ----- Equation 1
Here, each Q value is updated only when its corresponding action (m,p_m) is sampled, and thus has no information on the effect other action parameters p_i where i ?= m. There is a need to incorporate the other parameters p_i that has impact on m. Now, considering each Q network 226 value is a function of the joint action parameter vector p = (p_1,...,p_m), rather than only the action parameter pm corresponding to the associated selected object m which can be expressed as modified Bellman equation given by Equation 2.
Here (?_p) predict the joint action parameters.
Q(? s?_c,m,p)= ¦(E@r,? s'?_c )[r+ ¦(?maxQ@m'?M)(? s^'?_c ,m^',?_p (? s^'?_c ))|s_c,m,p] ----- Equation 2
The use of least squares loss function for (?_Q) shown in Equation 4 and loss function for (?_p) as given by Equation 3,
L_p (?_p )= -?¦?_(m=1)^M¦?Q(s_c,m,?_p (?_p );?_Q)? ----------- Equation 3
L_Q (?_Q )= ¦(E@(s_c,m,p,r,?s'?_c)?R_B )[1/2 ?(y-Q(s_c,m,p;?_Q ))?^2] ----------- Equation 4
Where, y = r +¦(?maxQ@m'?M)(? s^'?_c ,m^',?p(? s^'?_c;??_p);?_Q ) is the update target derived from Equation 2 and R_B is the replay buffer.
Now, the Q values are dependent on all action parameters which negatively affect the discrete action policy ?_Q and may cause the sub-optimal greedy action selection. To overcome such problem, the Q network 226 is splitted into separate networks for each discrete action. However, this is computationally intractable and k forward passes for k actions as shown in FIG.2 where joint action parameter is modified to make it sparse as p_ek. p_ek is the modified action parameter vector for action parameter p_k , where e_k is the standard basis vector for dimension k.
For example, p_ek= (0,0,0,..,p_k,..,0). This joint action parameter helps to negate the effect of other unassociated action-parameters in the selection of discrete actions. Q network 226 produces k × k Q values as each pass contains k numbers of Q values. Among these k×k Q values matrix, diagonal Q-values Q_i = (Q_11,Q_22,...,Q_kk) as shown in FIG.2 are important as this contains the associated Q values for each pass. By taking the argmax of Q_i the desired discrete action is obtained. The modified joint sparse action parameter vector helps to get proper action using k forward passes instead of using intractable k different Q network 226.
In one embodiment, the parameterized deep Q network 226 is trained by performing the steps of, Step 1 – determine by the Q network 226 of the parameterized deep Q network 202, the one or more discrete actions for a plurality of Q network 202 inputs comprising of a state space with size S and an action parameter with size P. These fully connected layers contain a dropout of about 0.5,
Where, FC(S+P,256) ? RELU ? Dropout(0.5) ? FC(256,64) ?RELU ? Dropout(0.25) ? FC(64,A).
Step 2 - determine by the P network of the parameterized deep Q network 202, a one or more continuous action parameters for the state space with size S. These fully connected layers contain a dropout of 0.5,
Where, FC(S,256) ? RELU ? Dropout(0.5) ? FC(256,64) ? RELU ? Dropout(0.25) ? FC(64,P).
Step 3 - generate for each action performed by the RL agent 204, a plurality of trajectories comprising of a next state from the current state of the RL agent 204, the current state, the discrete action state, and the reward and storing the plurality of trajectories in a replay buffer.
Step 4 - create a batch from the replay buffer and updating a learning value of the Q network 226 by minimizing a bellman error rate and setting a learning rate for the Q network and the parameter network P 224, and
Step 5 – perform a Polyak averaging to update the Q network and the parameter network P 224 and setting the learning rate with the average of the Q network 226 and the parameter network P 226.
The parameterized deep Q network 202 train the model in three steps. In the first step, the ?-greedy exploration to generate the trajectories (?s'?_c,a,r,s_c) and store into the replay buffer. Here, the size of replay buffer is 100000. In the second step, the sample batch (s_c,a,r,?s'?_c) from the replay buffer and update the Q value by minimizing the Bellman error. The value of learning rate for ?_Q and ?_p as 0.001 and 0.00001 respectively. The Adam optimizer for utilized for both the parameter network P 224 and the Q network 226. In third step, Polyak averaging is performed to update the target networks of the Q network and the parameter network P 224. The values are set for the rate of averaging for ?_Q and ?_p as 0.005 and 0.0005 respectively.
The parameterized deep Q network training dataset consists of 6, 000 distinct rearrangement specifically for room settings involving 72 different object types in 120 scenes. The test data set splits this dataset to evaluate the method of the present disclosure using the off-Policy RL agent 204 for training.
Further, the RL agent receives a reward based on a reward structure (FIG.5) for every action performed in the rearrangement environment. The reward structure comprises of an infeasible action reward, a static object reward, a collision resolution reward, a nearest neighbour reward, and a goal state reaching reward.
Referring now to FIG.5 illustrates a dense reward structure assigned to the RL agent for every action performed in the visual rearrangement scenario using the system of FIG.1, in accordance with some embodiments of the present disclosure.
In one embodiment, training the RL agent 204, to receive a hierarchical dense reward as detailed in the FIG.5. Unlike using the sparse reward in a long-term sequential decision-making problem, that makes the RL agent 204 poor in sample efficiency. The dense hierarchical reward is considered to make the RL agent 204 sample efficient by allowing the agent to do informative and effective exploration to reach a goal quickly in each episode at the time of training. The hierarchical reward gives meaningful feedback to the RL agent 204 which eliminates unnecessary traversal towards unintended places. The hierarchical rewards which are used to efficiently selects and place the correct each source object and each target object along with the handling of the blocked goals or collisions.
The reward structure comprises of an infeasible action reward, a static object reward, a collision resolution reward, a nearest neighbour reward, and a goal state reaching reward.
The infeasible action reward (R1) deals with the scenarios where the RL agent 204 has acted but that could not be realized in the rearrangement environment. It is penalized heavily if such action is produced by the method of the present disclosure. Now referring to (FIG.5), where the reward R1 block depicts this reward structure. The RL agent 204 is assigned with the infeasible action reward when an absurd action is performed in the rearrangement environment.
The static objects reward (R2) is another key aspect for any rearrangement to avoid redundant moves. The RL agent 204 is assigned with the static object reward, when a static object misplacement is performed in the rearrangement environment which decreases a number of redundant actions. Now referring to (FIG.5), where the reward R2 blocks defines static objects reward. The method prevents to select the static source objects, which are not movable to decrease the number of redundant moves.
The collision resolution reward (R3) describes that collision resolution reward gives priority to the goal occupied objects (which occupied the target location of other objects) to move first, instead the goal blocked objects. This helps to free the spaces for the goal blocked objects before their moves to resolve the collision. If the goal occupied object’s target location is occupied, then this reward ensures that the selection of free location of goal occupied object is nearest to the goal occupied object’s target locations instead of placing it to any random location. This helps to reduce the traversal of the RL agent 204.
The nearest neighbor reward (R4) uses the reward to ensure the traversal of the RL agent 204 should be minimal by arranging the nearest objects from the previously placed object first rather than arranging the objects randomly.
The goal reaching reward (R5) uses the reward (depicts in R5 block of FIG.5) to ensure that the method generates the action for each source object, whose goal position is free, with a single move. It penalizes negative residual Euclidean distance if it fails to place the object to its target location. The RL agent 204 is assigned with the goal state reaching reward when minimal steps are preformed to rearrange each target object with corresponding target location.
In one embodiment, the objective is just not to maximize the reward structure but generalized to previously unseen rearrangement configuration. It requires a diverse set of rearrangement configuration during the training. The off-policy method with replay buffer to support this kind of diversity similar. One of the challenges of the off policy is to strike a balance between exploration and exploitation. The ? greedy method handle this balancing. To stabilize training the two target networks (?_Q ) ¯ and (?_p ) ¯ both ?_Q and ?_p respectively to produce the target Q value y. These target networks maintain a lagged versions of network’s weights of ?_Q and ?_p are respectively trained. The Polyak averaging updates these target network as shown in Equation 5 and Equation 6,
(?_Q ) ¯ = ?_1.?_Q+(1-?_1 ).(?_Q ) ¯ ---------- Equation 5
(?_p ) ¯ = ?_2.?_p+(1-?_2 ).(?_p ) ¯ ---------- Equation 6
Here, ?_1 and ?_2 are the rate of averaging.
FIG.6 illustrates an experimental results of example source plan of the visual rearrangement scenario representing a current state of one or more objects positioned in a three-dimensional (3D) space using the system of FIG.1, in accordance with some embodiments of the present disclosure.
FIG.7 illustrates the experimental results example target plan of the visual rearrangement scenario representing a goal state of one or more objects to be positioned in the 3D space using the system of FIG.1, in accordance with some embodiments of the present disclosure. FIG.7 shows the target plan marking the one or more target objects to be replaced in the rearrangement environment 206. The method of the present disclosure is capable to perform rearrangements in different rooms for example a kitchen, drawing rooms which shows that the method of the present disclosure is generalized well across rooms irrespective their size. Table 1 represents comparison of room level rearrangements comparing with existing models with their success rate.
Table 1 – Comparison of room level rearrangements
Method Dataset % success rate ? % fixed strict ? % Energy remaining ? Changed
Existing method RoomR 83.4 91.2 0.09 2.3
Parameterized deep Q network of the present disclosure RoomR 97.01 98.32 0.033 2.98
Success Rate: Equals 1 if all object poses are in their goal states after performing the rearrangement task.
Fixed Strict: [1 - {(Remaining number of objects to be placed by agent at the end of rearrangement)/ (Total number of objects required to be placed in a rearrangement setup)}]. Equals 0 if the agent has moved an object that should not have been moved.
Energy Remaining: To allow the partial credit, define an energy function D that monotonically decreases to 0 as two poses get close together. This metric is then defined as the amount of energy remaining at the end of the rearrangement task divided by the total energy at the start of the rearrangement task.
Changed: This metric is simply the number of objects whose pose has been changed by the RL agent 204 during the rearrangement task.
In one embodiment, Table 2 shows that the method scale well with higher degree of complexity failed as they do not use the simulation step in one existing method to resolve complex search space with the increase complexity in the configuration.
Table 2 – Comparison of scalability and computational time of the rearrangement task
Number of Objects Number of swap pair Number of non swap Number of moves (existing method) Number of moves (parameterized deep Q network) Task plan time (ms) (existing method) Task plan time (ms) (parameterized deep Q network)
10 2
3
4 6
4
2 12
13
14 12
13
14 12.06
10.02
11.53 10.8
11.7
12.6
20 6
7
9 8
6
2 26
27
29 26
27
29 29.52
31.06
32.78 23.4
24.3
26.1
30 4
10
12 22
10
6 34
40
NSF 34
40
45 53.5
91.56
NSF 30.6
36.2
40.5
40 12
16
18 16
8
4 NSF
NSF
NSF 58
63
69 NSF
NSF
NSF 52.3
56.7
62.1
50 20
22
24 2
6
4 NSF
NSF
NSF 76
84
92 NSF
NSF
NSF 68.4
75.6
82.8
It also shows that the method of the present disclosure works faster for complex setup and scale linearly with increasing number of objects. Also, the method is trained on room level scene without having implicit notion of tabletop.
FIG.8 depicts exemplary experimental results illustrating visual rearrangement scenario executed by the trained parameterized deep Q network enabling the RL agent to perform action sequence on each target object location using the system of FIG.1, in accordance with some embodiments of the present disclosure. The drawing room example rearrangement scenario for understanding the importance of hierarchical reward structure.
Step1, Apple is selected from the black table.
Step2, Apple is placed in the white table (buffer location).
Step3, Cup is selected from the white table and placed at the Apple’s initial location on the black table.
Step4, Kettle is selected from the black table and placed at the white table (buffer location).
Step5, Apple is selected from the white table and placed at the Cup’s initial location on the white table.
Step6, Bowl is selected from the white table and placed at Kettle’s initial location on the black table.
Step7, Salt is selected from the black table and placed at the basin.
Step8, Bread is selected from the basin and placed at Salt’s initial position on the black table.
Step9, Kettle is selected from the white table and placed at bowl’s initial location on white table.
Step2, Step3 and Step5 show the swap between apple and cup.
Step4, Step6 and Step9 show the swap between kettle and bowl.
In another embodiment, a detailed ablation study was also performed by the system 100 of the present disclosure to understand the importance of each individual reward in the method of the present disclosure. Trained method was evaluated with a series of model in large room space by eliminating one or more rewards. Table 3 shows that all the individual rewards work together to produce the best result. It also shows that R4 and R5 together improve the planning as it takes the lesser number of object movements and reduces the agent overall traversal to complete the rearrangement task. It is observed from row1 and row2 of Table 3 that the number of moves is same in both the cases. But the distance traversal by the agent is lesser in the case of row1 due to the R4 reward.
Table 3 – Comparison of hierarchical dense reward structure
Number of Rewards Number of Movable objects Number of swap Number of Non swap Number of Moves Euclidean distance traversal (ms)
R1+R2+R3+R4+R5 10 2
3
4 6
4
2 12
13
14 18.323
19.641
22.456
R1+R2+R3+R5 10 2
3
4 6
4
2 12
13
14 26.131
28.739
31.235
R1+R2+R3 10 2
3
4 6
4
2 17
20
24 36.212
42.923
52.537
R1+R2+R4+R5 10 2
3
4 6
4
2 23
34
42 50.233
72.765
88.864
R1+R3+R4+R5 2
3
4 6
4
2 49
53
62 98.345
107.329
125.981
R2+R3+R4+R5 10 2
3
4 6
4
2 NSF
NSF
NSF NSF
NSF
NSF
In FIG.8 the target location of the Apple is blocked by the Cup and the target location of the Cup is blocked by the Apple, so in Step2 instead of placing Apple to any predefined free space in the room, the RL agent 204 automatically choose the free space which is near to its target location. In Step3, by the help of R4 reward, agent selects the cup which is the nearest object of the previously placed object Apple instead of choosing other objects which are placed far from Apple. This ensures that the reward R3 and R4 are capable to achieve the minimal traversal of the RL agent in the rearrangement scenario. To understand the importance of dense reward for long-horizon the trained parameterized deep Q network 226 with the sparse reward. The sparse reward -50 if failed at the end of episode and reward 50 if it succeeds.
FIG.9 shows a comparison of a number of average returns when RL agent is trained with a sparse reward structure and the dense reward structure using the system of FIG.1, in accordance with some embodiments of the present disclosure. FIG.9 shows average return, which is not improving with the increasing steps, but the dense reward is giving a good average return with increasing steps. This proves that the dense reward is providing effective feedback to the method of the present disclosure to generate the feasible task plan. The extensive experiment to find out the suitable value for ? to balance the exploration and exploitation in the off policy. For example, the sample ? is in a range of about [0, 1] and experiment on room level environment to compute the average return (R1 + R2 + R3 + R4 + R5) for each episode.
FIG.10 shows a comparison of the number of average returns versus a number of increasing steps performed by the RL agent to replace each object using the system of FIG.1, in accordance with some embodiments of the present disclosure. FIG.10 shows for ? = 0.04 gives the best average return.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiments of the present disclosure herein addresses unresolved problem of objects rearrangement. The embodiment thus provides method and system for planning a large-scale object rearrangement using deep reinforcement learning. Moreover, the embodiments herein further provide a deep RL agent executing a robust and efficient task plan for large-scale rearrangement. The method of the present disclosure is capable of replacing increasing number of target objects with complex rearrangement setups like multiple swap instances. The dense reward structure helps to generate the task plan with minimum moves and reduces the overall traversal of the RL agent. The rearrangement environment demonstrates across different scenarios from the 2D surfaces such as tabletops to the 3D rooms with a large number of objects and without any explicit need of the buffer space.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:We Claim:
1. A processor implemented method (300) for planning a large-scale object rearrangement using deep reinforcement learning, comprising:
receiving (302) by a trained reinforcement learning (RL) agent placed in a rearrangement environment, via one or more hardware processors, a source plan, and a target plan, wherein the source plan comprises of a current state of one or more source objects positioned in a three-dimensional (3D) space of the rearrangement environment, wherein the target plan comprises of a goal state of one or more target objects to be rearranged in the 3D space of the rearrangement environment;
locating (304) by the RL agent, via the one or more hardware processors, the current state of each source object and the goal state of each target object, and predicting an action sequence to replace the current state of each source object with corresponding goal state of each target object; and
generating (306), a task plan by utilizing a trained parameterized deep Q network, via the one or more hardware processors, executed by the RL agent to rearrange the current state of each source object to achieve the goal state of each target object position comprises:
constructing, a discrete action and a continuous action to reach the goal state of each target object from the current position of the RL agent, wherein the discrete action selects each source object from the current state blocking the goal state of corresponding target object, and placing the selected object in a dynamic intermediate buffer space, wherein, the continuous action traces a destination coordinate from the current position of the RL agent to reach the goal state of each target object, wherein, the Q network of the parameterized deep Q network decides the blocked object to be selected from the current state and the parameter network P of the parameterized deep Q network predicts a continuous coordinates where the selected source object to be placed in the dynamic intermediate buffer space; and
assigning, a reward based on a reward structure for every action performed by the RL agent in the rearrangement environment, wherein the reward structure comprises of an infeasible action reward, a static object reward, a collision resolution reward, a nearest neighbour reward, and a goal state reaching reward.
2. The processor implemented method as claimed in claim 1, wherein the trained parameterized deep Q network enables the RL agent to perform action sequence on each target object location preoccupied by at least one of each source object and each target object present in a collided rearrangement environment and a non-collided rearrangement environment.
3. The processor implemented method as claimed in claim 1, wherein the dynamic intermediate buffer space is predicted at runtime for positioning and swapping the current state of each source object with the corresponding goal state of each target object for avoiding collision occurrence.
4. The processor implemented method as claimed in claim 1, wherein the parameterized deep Q network is trained by performed the steps of:
determining, by the Q network of the parameterized deep Q network, a one or more discrete actions for a plurality of Q network inputs comprising of a state space with size S and an action parameter with size P;
determining, by the P network of the parameterized deep Q network, a one or more continuous action parameters for the state space with size S;
generating, for each action performed by the RL agent, a plurality of trajectories comprising of a next state from the current state of the RL agent, the current state, the discrete action state, and the reward and storing the plurality of trajectories in a replay buffer;
creating, a batch from the replay buffer and updating a learning value of the Q network by minimizing a bellman error rate and setting a learning rate for the Q network and the P network; and
performing, a Polyak averaging to update the Q network and the P network and setting the learning rate with the average of the Q network and the P network.
5. The processor implemented method as claimed in claim 1, wherein the RL agent initiates a first action in the rearrangement environment by randomly selecting each source object to be replaced with the corresponding goal state of each target object.
6. The processor implemented method as claimed in claim 1, wherein assigning the RL agent with the infeasible action reward when an absurd action is performed in the rearrangement environment.
7. The processor implemented method as claimed in claim 1, wherein assigning the RL agent with the static object reward, when a static object misplacement is performed in the rearrangement environment which decreases a number of redundant actions.
8. The processor implemented method as claimed in claim 1, wherein assigning the RL agent with the collision resolution reward for moving the goal state occupied target object position with a nearest empty location which reduces traversal.
9. The processor implemented method as claimed in claim 1, wherein assigning the RL agent with the nearest neighbour reward for minimal traversal performed to rearrange the current state of each source object with corresponding goal state of each target object.
10. The processor implemented method as claimed in claim 1, wherein assigning the RL agent with the goal state reaching reward when minimal steps are preformed to rearrange each target object with corresponding target location.
11. A system (100), for planning a large-scale object rearrangement using deep reinforcement learning comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive by a trained reinforcement learning (RL) agent placed in a rearrangement environment, a source plan, and a target plan, wherein the source plan comprises of a current state of one or more source objects positioned in a three-dimensional (3D) space of the rearrangement environment, wherein the target plan comprises of a goal state of one or more target objects to be rearranged in the 3D space of the rearrangement environment;
locate by the RL agent, the current state of each source object and the goal state of each target object, and predicting an action sequence to replace the current state of each source object with corresponding goal state of each target object; and
generate, a task plan by utilizing a trained parameterized deep Q network, executed by the RL agent to rearrange the current state of each object to achieve the goal state of each target object position comprises:
construct, a discrete action and a continuous action to reach the goal state of each target object from the current position of the RL agent, wherein the discrete action selects each source object from the current state blocking the goal state of corresponding target object, and placing the selected object in a dynamic intermediate buffer space, wherein, the continuous action traces a destination coordinate from the current position of the RL agent to reach the goal state of each target object, wherein, the Q network of the parameterized deep Q network decides the blocked object to be selected from the current state and the parameter network P of the parameterized deep Q network predicts a continuous coordinate where the selected source object to be placed in the dynamic intermediate buffer space; and
assign, a reward based on a reward structure for every action performed by the RL agent in the rearrangement environment, wherein the reward structure comprises of an infeasible action reward, a static object reward, a collision resolution reward, a nearest neighbour reward, and a goal state reaching reward.
12. The system as claimed in claim 11, wherein the RL agent initiates a first action in the rearrangement environment by randomly selecting each source object to be replaced with the corresponding goal state of each target object.
13. The system as claim in claim 11, wherein the parameterized deep Q network is trained by performed the steps of:
determining, by the Q network of the parameterized deep Q network, a one or more discrete actions for a plurality of Q network inputs comprising of a state space with size S and an action parameter with size P;
determining, by the P network of the parameterized deep Q network, a one or more continuous action parameters for the state space with size S;
generating, for each action performed by the RL agent, a plurality of trajectories comprising of a next state from the current state of the RL agent, the current state, the discrete action state, and the reward and storing the plurality of trajectories in a replay buffer;
creating, a batch from the replay buffer and updating a learning value of the Q network by minimizing a bellman error rate and setting a learning rate for the Q network and the P network; and
performing, a Polyak averaging to update the Q network and the P network and setting the learning rate with the average of the Q network and the P network.
14. The system as claimed in claim 11, wherein the trained parameterized deep Q network enables the RL agent to perform action sequence on each target object location preoccupied by at least one of each source object and each target object present in a collided rearrangement environment and a non-collided rearrangement environment.
15. The system as claimed in claim 11, wherein the dynamic intermediate buffer space is predicted at runtime for positioning and swapping the current state of each source object with the corresponding goal state of each target object for avoiding collision occurrence.
16. The system as claimed in claim 11, wherein the RL agent initiates a first action in the rearrangement environment by randomly selecting each source object to be replaced with the corresponding goal state of each target object.
17. The system as claimed in claim 11, wherein assigning the RL agent with the infeasible action reward when an absurd action is performed in the rearrangement environment.
18. The system as claimed in claim 11, wherein assigning the RL agent with the static object reward, when a static object misplacement is performed in the rearrangement environment which decreases a number of redundant actions.
19. The system as claimed in claim 11, wherein assigning the RL agent with the collision resolution reward for moving the goal state occupied target object position with a nearest empty location which reduces traversal.
20. The system as claimed in claim 11, wherein assigning the RL agent with the nearest neighbour reward for minimal traversal performed to rearrange the current state of each source object with corresponding goal state of each target object.

Dated this 19th Day of July 2022
Tata Consultancy Services Limited
By their Agent & Attorney

(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086

Documents

Application Documents

#	Name	Date
1	202221041325-STATEMENT OF UNDERTAKING (FORM 3) [19-07-2022(online)].pdf	2022-07-19
2	202221041325-REQUEST FOR EXAMINATION (FORM-18) [19-07-2022(online)].pdf	2022-07-19
3	202221041325-FORM 18 [19-07-2022(online)].pdf	2022-07-19
4	202221041325-FORM 1 [19-07-2022(online)].pdf	2022-07-19
5	202221041325-FIGURE OF ABSTRACT [19-07-2022(online)].jpg	2022-07-19
6	202221041325-DRAWINGS [19-07-2022(online)].pdf	2022-07-19
7	202221041325-DECLARATION OF INVENTORSHIP (FORM 5) [19-07-2022(online)].pdf	2022-07-19
8	202221041325-COMPLETE SPECIFICATION [19-07-2022(online)].pdf	2022-07-19
9	202221041325-FORM-26 [20-09-2022(online)].pdf	2022-09-20
10	Abstract1.jpg	2022-09-24
11	202221041325-Proof of Right [11-01-2023(online)].pdf	2023-01-11
12	202221041325-FER.pdf	2025-05-06
13	202221041325-FER_SER_REPLY [14-10-2025(online)].pdf	2025-10-14
14	202221041325-DRAWING [14-10-2025(online)].pdf	2025-10-14
15	202221041325-CLAIMS [14-10-2025(online)].pdf	2025-10-14

Search Strategy

1	Search_Strategy_MatrixE_11-09-2024.pdf