Method And System For Simultaneous Learning And Planning

< Back

Method And System For Simultaneous Learning And Planning

Abstract: This disclosure relates to path planning of an agent using learning and planning techniques. The disclosure is a simultaneously learning and planning in a partially known/uncertain environment for an agent, without separate requirement of offline training. The disclosed technique, learns policy (sequence of actions) using Reinforcement Learning (RL), while using RRT for planning simultaneously, hence once learning a policy (sequence of steps to reach a goal) for a fixed start and goal, repeated planning for identical start and goal is avoided. The disclosed framework exploits the Voronoi bias property of Rapidly exploring Random Tree (RRT), which balances the exploration-exploitation in RL. In scenarios of environmental uncertainties ( obstacle detection or removal of a previously detected object), the RL dynamically adapts the learned policy with the help of RRT. Hence, both learning and planning are performed simultaneously to complement each other to handle environmental uncertainties dynamically in real-time and online. [To be published with FIG. 2]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

14 June 2021

Publication Number

50/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

kcopatents@khaitanco.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-07-29

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. SADHU, Arup Kumar

Tata Consultancy Services Limited Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata West Bengal India 700160

2. SHUKLA, Shubham

3. SORTEE, Sarvesh

4. LUDHIYANI, Mohit

5. DASGUPTA, Ranjan

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR SIMULTANEOUS LEARNING AND
PLANNING
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD
[001] The disclosure herein generally relates to the field of path planning by robots based on learning techniques, and, more particularly, to a method and a system for simultaneous learning and planning techniques for path planning of robots.
BACKGROUND
[002] In the research of autonomous mobile robots/ autonomous vehicles, path planning is a core part of extensive area of research. Path planning/motion planning, for autonomous vehicles/agents is a process of finding a contiguous path to travel from a source to destination in an uncertain environment, wherein the environment comprises of uncertainties that include dynamic objects as well objects that can be moved based on external force.
[003] Traditional path planning methods include graph search algorithms such as Dijkstra’s algorithm or free space based path planning or Probabilistic Roadmaps (PRM) or sampling techniques such as Rapidly-Exploring Random Trees (RRT), etc. However, the traditional path planning methods may not be suitable for a high dimensional configuration space or are able to provide a deterministic path in real-time with very less storage requirement or require an accurate map of the environment. However, obtaining an accurate map is challenging amidst the uncertainties in the environment.
[004] Further few state-of-the-art techniques for path planning utilize learning based sampling techniques, wherein learning refers to the process of acquiring new knowledge through experience aiming to solve a task optimally. However, learning-based planning techniques need to learn offline ( offline training or learning ) before online planning, wherein during the offline training phase, the agent cannot plan its actions/path.
SUMMARY

[005] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method and a system for simultaneous learning and planning is provided. The system includes
[006] In another aspect, a method for simultaneous learning and planning is provided. The method includes receiving a plurality of inputs associated with an agent in an uncertain environment, wherein the plurality of inputs comprises a floor plan of the uncertain environment, a start position, a goal position, an agent dynamics, and a plurality of sampling parameters. Further the method comprises identifying a policy for a path planning by the agent, from a policy learning base for the start position and the goal position, where the policy is a Q-tree comprising a path between the start position and the goal position, wherein the path comprises of a plurality of nodes represented by a plurality of Q-values and performing based on the identification of the policy one of below: on not identifying the policy, then planning the path for the agent using a path planning-policy learning technique, that further comprises: planning the path between the start position and the goal position based on a sampling path planning technique using the plurality of sampling parameters and simultaneously learning a policy for the path planned based on a reinforcement learning technique and saving the policy in the learning base or on identifying the policy, path planning by the agent by determining a sequence of optimal actions using the policy, wherein an optimal action from the sequence of optimal actions is identified by iteratively evaluating the nodes of the policy, subject to a detection of obstacles, by performing one of based on the detection of obstacles: upon detecting an obstacle between a current state of the agent and a next state as per the path (policy), then performing the planning and policy learning technique for path planning by the agent or upon identifying a removal of a previously detected obstacle between the current state of the agent and a next state as per the path (policy), then performing a policy updating technique or upon not detecting an obstacle between the current state of the agent and a next state as per the path

(policy), continue to iteratively evaluate the nodes of the policy till reaching the goal position based on the policy.
[007] The system includes a memory storing instructions, one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to receive a plurality of inputs associated with an agent in an uncertain environment, wherein the plurality of inputs comprises a floor plan of the uncertain environment, a start position, a goal position, an agent dynamics, and a plurality of sampling parameters. Further the system is configured for identifying a policy for a path planning by the agent, from a policy learning base for the start position and the goal position, where the policy is a Q-tree comprising a path between the start position and the goal position, wherein the path comprises of a plurality of nodes represented by a plurality of Q-values and performing based on the identification of the policy one of below: on not identifying the policy, then planning the path for the agent using a path planning-policy learning technique, that further comprises: planning the path between the start position and the goal position based on a sampling path planning technique using the plurality of sampling parameters and simultaneously learning a policy for the path planned based on a reinforcement learning technique and saving the policy in the learning base or on identifying the policy, path planning by the agent by determining a sequence of optimal actions using the policy, wherein an optimal action from the sequence of optimal actions is identified by iteratively evaluating the nodes of the policy, subject to a detection of obstacles, by performing one of based on the detection of obstacles: upon detecting an obstacle between a current state of the agent and a next state as per the path (policy), then performing the planning and policy learning technique for path planning by the agent or upon identifying a removal of a previously detected obstacle between the current state of the agent and a next state as per the path (policy), then performing a policy updating technique or upon not detecting an obstacle between the current state of the agent and a next state as per the path (policy), continue to iteratively evaluate the nodes of the policy till reaching the goal position based on the policy.

[008] In yet another aspect, a non-transitory computer readable medium for simultaneous learning and planning is provided. The program includes receiving a plurality of inputs associated with an agent in an uncertain environment, wherein the plurality of inputs comprises a floor plan of the uncertain environment, a start position, a goal position, an agent dynamics, and a plurality of sampling parameters. Further the program comprises of identifying a policy for a path planning by the agent, from a policy learning base for the start position and the goal position, where the policy is a Q-tree comprising a path between the start position and the goal position, wherein the path comprises of a plurality of nodes represented by a plurality of Q-values and performing based on the identification of the policy one of below: on not identifying the policy, then planning the path for the agent using a path planning-policy learning technique, that further comprises: planning the path between the start position and the goal position based on a sampling path planning technique using the plurality of sampling parameters and simultaneously learning a policy for the path planned based on a reinforcement learning technique and saving the policy in the learning base or on identifying the policy, path planning by the agent by determining a sequence of optimal actions using the policy, wherein an optimal action from the sequence of optimal actions is identified by iteratively evaluating the nodes of the policy, subject to a detection of obstacles, by performing one of based on the detection of obstacles: upon detecting an obstacle between a current state of the agent and a next state as per the path (policy), then performing the planning and policy learning technique for path planning by the agent or upon identifying a removal of a previously detected obstacle between the current state of the agent and a next state as per the path (policy), then performing a policy updating technique or upon not detecting an obstacle between the current state of the agent and a next state as per the path (policy), continue to iteratively evaluate the nodes of the policy till reaching the goal position based on the policy.
[009] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
[010] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[011] FIG.1 illustrates an exemplary system for simultaneous learning and planning according to some embodiments of the present disclosure.
[012] FIG.2 is a functional block diagram of the system for simultaneous learning and planning according to some embodiments of the present disclosure.
[013] FIG.3A, FIG.3B and FIG.3C is a flow diagram illustrating a method for simultaneous learning and planning in accordance with some embodiments of the present disclosure.
[014] FIG.4A illustrates a task to explore a path from a start position to a goal position and FIG.4B illustrates a path obtained using a sampling path planning technique in accordance with some embodiments of the present disclosure.
[015] FIG.5 illustrates an example of quadcopter for Q-learning in accordance with some embodiments of the present disclosure.
[016] FIG.6A illustrates a path with an obstacle and FIG.6B illustrates removal of obstacle from the path shown in FIG.6A in accordance with some embodiments of the present disclosure.
[017] FIG.7A, FIG.7B, FIG.7C and FIG.7D illustrate a policy learning in an uncertain environment accordance with some embodiments of the present disclosure.
[018] FIG.8A, FIG.8B, FIG.8C and FIG.8D illustrate a policy learning in an uncertain environment accordance with some embodiments of the present disclosure.
[019] FIG.9A and FIG.9B illustrate a policy learning in learning in an uncertain environment with obstacles in accordance with some embodiments of the present disclosure.
[020] FIG.10A and FIG.10B illustrate an average execution time using the method in comparison with a state of art technique in accordance with some embodiments of the present disclosure.

[021] FIG.11A and FIG.11B illustrate an average execution time variation corresponding to the iterations for which obstacles are detected in accordance with some embodiments of the present disclosure.
[022] FIG.12A and FIG.12B illustrate adaptability to the uncertain environment conditions using the method in accordance with some embodiments of the present disclosure.
[023] FIG.13A and FIG.13B illustrates an example for adaptability to the uncertain environment conditions using state-of-the-art techniques for experimentation purposes.
[024] FIG.14A and FIG.14B illustrate percentage of frequency of exploration with that of obstacle detected using the method in accordance with some embodiments of the present disclosure.
[025] FIG.15A and FIG.15B illustrate adaptability to the uncertain environment conditions implemented using robotic operating system(ROS) in accordance with some embodiments of the present disclosure.
[026] FIG.16A illustrates obstacle configuration on a floor map of an uncertain environment, FIG.16B illustrates planning and adapting, FIG.16C illustrates planning after adaption and FIG.16D illustrates planning in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[027] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[028] Referring now to the drawings, and more particularly to FIG.1 through FIG.16D, where similar reference characters denote corresponding features

consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[029] FIG.1 is a functional block diagram of a system 100 for simultaneous learning and planning in accordance with some embodiments of the present disclosure.
[030] In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
[031] Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 is configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, a network cloud and the like.
[032] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, a touch user interface (TUI) and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting a number of devices (nodes) of the system 100 to one another or to another server.

[033] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[034] Further, the memory 102 may include a database 108 configured to include information regarding path planning that comprises a plurality of previous paths learnt. Thus, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. In an embodiment, the database 108 may be external (not shown) to the system 100 and coupled to the system via the I/O interface 106.
[035] Functions of the components of system 100 are explained in conjunction with functional overview of the system 100 in FIG.2 and flow diagram of FIGS.3A through 3C for simultaneous learning and path planning.
[036] The system 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The network environment enables connection of various components of the system 100 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 100 is implemented to operate as a stand-alone device. In another embodiment, the system 100 may be implemented to work as a loosely coupled device to a smart computing environment. The components and functionalities of the system 100 are described further in detail.
[037] FIG.2 is a functional block diagram of the various modules of the system of FIG.1, in accordance with some embodiments of the present disclosure. As depicted in the architecture, the FIG.2 illustrates the functions of the components of the system 100 that includes simultaneous learning and path planning.
[038] The system 200 for simultaneous learning and planning is configured for receiving a plurality of inputs associated with an agent in an uncertain environment via an input module 202. The system 200 comprises of a policy identification module 204 configured for identifying a policy from a learning

base 206 for a path planning by the agent, wherein the learning base 206 is a dynamic database for learning (and saving) a plurality of paths planned by the agent. Further based on the identification of the policy in the policy identification module 204, performing one of : on not identifying the policy - planning the path for the agent using a path planning-policy learning technique in the planning module 208 of the system 200 . Further on identifying the policy - path planning by the agent by determining a sequence of optimal actions using the policy in the optimal action determination module 210 of the system 200. The optimal action determination module 210 configured for determining a sequence of optimal actions, and further comprises of a policy updating module 212 and a policy-based planning module 214 for updating the policy if object is detected or previously detected object is removed. Further the system 200 comprises an output module 216 configured to display the path planned by the agent based on simultaneous learning and path planning.
[039] The various modules of the system 100 for simultaneous learning and planning are implemented as at least one of a logically self-contained part of a software program, a self-contained hardware component, and/or, a self-contained hardware component with a logically self-contained part of a software program embedded into each of the hardware component that when executed perform the above method described herein.
[040] Functions of the components of the system 200 are explained in conjunction with functional modules of the system 100 stored in the memory 102 and further explained in conjunction with flow diagram of FIGS. 3A, 3B and 3C.
[041] FIG.3A, FIG.3B and FIG.3C with reference to FIG.1, is an exemplary flow diagram illustrating a method 300 for simultaneous learning and planning using the system 100 of FIG.1 according to an embodiment of the present disclosure. The steps of the method 300 of the present disclosure will now be explained with reference to the components of the system(100) for simultaneous learning and planning and the modules (202-216) as depicted in FIGS.2 and the flow diagrams as depicted in FIG.3A through FIG.3C. Although process steps, method steps, techniques or the like may be described in a sequential order, such

processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[042] At step 302 of the method (300), a plurality of inputs associated with an agent in an uncertain environment is received at the input/output module 202. The plurality of inputs comprises a floor plan of the uncertain environment, a start position (s), a goal position (g), an agent dynamics (Ad), and a plurality of sampling parameters.
[043] The plurality of sampling parameters comprises a learning rate (a), a discounting factor (Y), a point closest to the goal position (εg) and a step size (∆t). Further the agent dynamics (Ad) defines a relation between the current state of the agent with a next state of the agent by executing an action at the current state of the agent, wherein the current state of the agent is the agent’s current state on the floor plan, represented as a position and an orientation. Further each of a node from the plurality of nodes is represented by a Q-value with a state (c) and an action (a).
[044] In an embodiment, an uncertain environment includes dynamic obstacles as well as obstacles that move based on external application of force. The floor plan received as an input is a map of the uncertain environment, wherein the position of obstacles is bound to changes.
[045] At the next step 304 of the method (300), a policy for a path planning is identified by the agent, via the policy identification module 204, from the learning base 206. The policy is identified as a similar matching policy for the start position (s) and the goal position (g), wherein s and g are received as an input. The policy that is identified is a Q-tree, wherein the Q-tree of the policy comprises a path that can be taken by the agent between the start position and the goal position. The path comprises of a plurality of nodes represented by a plurality of Q-values. Based on the identification of the policy, one of the steps 304A or step 304B is performed, the details of which are explained in below section.

[046] Considering the scenario of - on not identifying the policy in the learning base 206, at step 304A, the path is planned for the agent in the planning module 208. The path is planned for the agent using a path planning-policy learning technique. The path planning-policy learning technique comprises of several steps that is explained in detail as illustrated in FIG.3B as explained below:
[047] At step 304AA of method 300, the path is planned between the start position(s) and the goal position(g) based on a sampling path planning technique using the plurality of sampling parameters.
[048] In an embodiment, the sampling path planning technique includes a rapidly exploring random tree (RRT), an RRT based on a Voronoi bias principle. The RRT based on the Voronoi bias principle and the term “RRT*” would be referred to interchangeable in the specification.
[049] In example scenario of RRT* path planning is illustrated in FIG.4A and FIG.4B, wherein FIG.4A shows that expansion of RRT* branches with an aim to explore a path from s to g. The path obtained is displayed in FIG.4B, wherein there are several intermediate nodes between s and g. Each node is associated with (c,a,Q(c,a)),wherein the starting node “s” is represented as (s, a1,Q(s, a1)).
[050] At step 304AB of method 300, a policy is learned simultaneously for the path planned based on a reinforcement learning technique in the learning base 206. The learnt policy is saved in the learning base 206.
[051] In an embodiment, the reinforcement learning technique is based on a combination of a Q-learning and a State-action-reward-state-action (SARSA) technique. The disclosed Q-learning-SARSA is implemented in several steps that includes receiving the path generated by the sampling path planning technique, identifying the state (c) and the action (a), designing a reward (r) and finally generating a policy (Q-tree) by obtaining a Q-value.
[052] The path generated by the sampling path planning technique is received at the learning base 206, wherein the path comprises of a plurality of nodes between the start position(s) and the goal position(g). Further the state (c) and the action (a) is identified for each of a node from the plurality of nodes, wherein the

state is considered as the agent’s configuration space and the action is equivalent to action generated by the sampling path planning technique.
[053] Considering an example of quadcopter, as shown in FIG.5, for which state is in inertial frame, where respectively are roll, pitch
and yaw of the quad (agent). The action a is in body frame; where
respectively are velocity in x; y; z direction of quad, and ω is the yaw rate.
[054] Further based on the state (c) and the action (a) associated with each node , a reward (r) is designed for each node. If Rmax ∈ R + be the maximum immediate reward agent can have after executing an action a from a state c. Then immediate reward r(c; a) as given below due to executing a at c:

where g ∈ C is the goal state, εg is the goal tolerance and Rmax is the maximum immediate reward an agent can have because of a state-transition
[055] Considering the scenario in FIG.4B, r(c, a) is maximum for the goal state-transition which leads to the goal position g. In case of non-
goal state-transition, r(c,a) is set to 0. Any state-transition leading to an obstruction is penalized by setting
[056] Further based a policy (Q-tree) is generated by obtaining a Q-value for each node based on the state (c), the action (a) and the reward.
[057] In an embodiment, RRT* generates control signals/actions to explore the uncertain environment quickly as shown in FIG. 4B. This control signal generation continues until that is till goal exploration. Once goal
explored a path is obtained using RRT* from s to g as shown in FIG.4B. This path and rewards at each node are estimated using below Q value expression.

[058] Further values at each node is employed as the recorded past data for Backward-Q Learning (BQL) to update Q-values at each node. Considering a scenario as shown in FIG.4A and FIG.4B, by BQL using Q value

equation (2), with value at the goal positionis undefined). Similarly, by equation (2) with a =
Hence in general, for each intermediate state between s and g can be expressed as shown below :

[059] Q-tree is updated based on the computed Q-values for the path as shown in FIG.4A and FIG.4B as given in Table.1 below:

Table 1: Updated Q-tree for path between s and g
[060] Referring back to FIG.3A, at the next step 304B of the method (300), considering the scenario of - on identifying the policy, then the path planning by the agent is performed in the optimal action determination module 210. A sequence of optimal actions to be performed by the agent for path planning is determined using the policy. Sequentially one optimal action from the sequence of optimal actions is identified by iteratively evaluating the nodes of the policy, subject to a detection of obstacles. For each optimal action, nodes are evaluated to detect obstacles. Based on detection of obstacles, one of optimal actions as illustrated in FIG.3C is performed, wherein on detection of object a new path is planned, on not detecting any obstacle then the same policy is used , however if a previously detected object is removed, then the policy is updated. Hence based on the detection

of obstacles, one of optimal action to be performed as illustrated in FIG.3C is explained in detail below:
[061] At step 304BA of method 300, upon detecting an obstacle then the planning and policy learning technique is performed for path planning in the planning module 208 and the learning base 206. The obstacle is detected between a current state of the agent and a next state (c´) as per the path (policy).
[062] In an embodiment, the detection of obstacles is performed based on an obstacle detection technique. The obstacle detection technique includes capturing a view of the environment using a sensor.
[063] In an embodiment, the path planning and policy learning technique has been explained as in FIG.3B. The path is planned based on the sampling path planning technique ( RRT and RRT*) in the planning module 208. The path planned is received by the learning base 206 to learn the policy from the path. The policy is saved in the learning base 206 for further use. Once a policy is learnt for a fixed start and goal points, repeated path planning for the same start position and the same goal positions can be avoided by identification of similar policies from the learning base 206.
[064] At step 304BB of method 300, upon identifying removal of a previously detected obstacle then a policy updating technique is performed in the policy updating module 212. The object removal is detected between the current state of the agent and a next state (c´) as per the path (policy).
[065] In an embodiment, the policy updating technique is performed using several steps that include re-initializing the corresponding Q-value and further updating the policy at the current node based on the planning and policy learning technique.
[066] An example scenario of an obstacle in a path is illustrated in RRT* path planning in FIG.6A and FIG.6B, wherein FIG.6A shows several intermediate nodes between the s and the g, along with the obstacle (between c2 and c3), wherein the path is designed so as to avoid the obstacle. In FIG.6B, the obstacle no longer exists in between c2 and c3 , hence the obstacle has been removed. In such

a scenario where a previously existing obstacle has been removed, the policy updating technique is implemented to update the policy.
[067] Based on the detected obstacle, policy learned in Q-tree are dynamically adapted as listed in Table 1. In the example scenario of FIG.6A, where an obstacle is detected between c2 and c3. By dynamically update with
a = 1 as follows:

[068] Hence, if the obstacle is detected between nth and node,
then above equation can be written as shown below:
where,
[069] Hence if an obstacle is detected, then Q becomes less than zero. Further the policy is updated for two scenarios including “with obstacle” and “without obstacle/removal of previously detected obstacle”. Considering a scenario wherein the agent does not detect any obstacle as illustrated in FIG.6B for a policy that is being exploited, then it is understood that the obstacle is removed. In this scenario where the previously detected obstacle is removed, the policy is updated by re-initializing the corresponding Q value to 0 and further updating the policy at the node of previously detected obstacle (c3 and c2) based on the planning and policy learning technique as illustrated and explained using FIG.3B.
[070] At step 304BC of method 300, upon not detecting an obstacle then nodes of the policy are iteratively evaluated the till reaching the goal position (g) based on the policy in the policy based planning module 214. Hence based on the policy learnt, the next nodes as per the policy are selected as optimal actions till detection of any obstacles. The object detection is performed iteratively between the current state of the agent and a next state as per the path (policy).

[071] Once a policy is learnt for a fixed start and goal points, repeated path planning for the same start position and the same goal positions can be avoided by exploiting the already learned path. However, once the agent has learned a policy to follow the path, then the agent plans using the learned policy amidst environmental uncertainties. To select optimal action for Q-value based planning the agent selects an optimal strategy a*(based on the learned Q-values) at c expressed below:

[072] Considering the example scenarios of FIG.4B, each node has only one action to choose. However, in FIG. 6B, at node c2, the agent has two actions to choose from, at c2 in FIG. 6B, agent selects . However, at c2 in FIG. 6A,
agent selects So, agent is dynamically selecting optimal action based on
the adapted Q-values because of the environmental uncertainties.
[073] Hence with the disclosed technique of simultaneous learning and path planning, the agent can avoid repeated planning by exploiting the learned policy. Further considering that environments can be uncertain with many obstacles, the RL dynamically adapts the learned policy with the help of RRT or the agent can plan a new path as required. Thus, both learning and planning complement each other to handle environmental uncertainties dynamically in real¬time and online.
[074] The identified paths based on the simultaneous learning and planning is displayed on an output module 216.
[075] EXPERIMENTS:
[076] Several experiments have been conducted to simulate the performance of the above disclosed simultaneous learning and planning (SLPA) methods in compared with the techniques (RRT based on Voronoi bias
property). For experimentation purposes, Simulation are conducted in a system with an processor and 8GB of RAM in
MATLAB, and in different systems on ROS. An uncertain environment/arena (say floor plan) of 100m X100m is considered for all the simulations. An initial global path is created, considering obstacle free arena, using RRT* from start position (25;

25) to goal position (75; 75) as shown in FIG.7A. The same path is then used by both the SLPA and RRT* throughout the simulation. Further to understand the results for the policy updating feature, uncertainties are introduced in the environment, wherein obstacles of dimension 6m X 6m are created randomly in the said environment. The obstacle configurations are kept identical in corresponding iterations for both SLPA and RRT*. The time step size for RRT* is ∆t = 2s and goal tolerance εg = 5m. The learning rate and discounting factor and are respectively set to 1 (for fast learning) and 0.9. The value of rmax is fixed at 1. In all the simulations, the agent is considered to be moving in a plane with quadcopter dynamics. Also, few simulations are also conducted on Gazebo based Hector platform in Robot Operating system (ROS), that provides modelling, control and simulation of a quadcopter for experimentation purposes.
[077] Simulation results are explained in four parts. Simulation 1 deals with the evolution of the Q-tree (policy) over the iterations. Simulation 2 studies average execution time of the proposed SLPA over that of the RRT*. Simulation 3 shows the efficacy of the proposed SLPA in terms of adapting with the environmental uncertainties. Finally, in Simulation 4, SLPA is tested on Hector platform in Gazebo from Robot operating system (ROS) - for experimentation purposes.
[078] The first three simulations are conducted for 70 iterations, and the simulations are repeated five times to obtain average values. In simulation 1, the adaption of the Q-tree (policy) over the iterations because of the uncertainties introduced in the environment is demonstrated. In simulation 2, average execution time is the performance metric. Here, average execution time refers to the time taken by an agent to move from a given start to goal position in each iteration calculated from 70 iterations. The said average execution time is analyzed in the following circumstances: (i) for 70 iterations individually; (ii) Mean, standard deviation, min, max of average execution time in the said 70 iterations; (iii) the number of iterations out of 70 in which obstacles encountered. In Simulation 3, the adaptability towards environmental uncertainties is considered as the performance metric. The adaptability towards environmental uncertainties refers to the

situations, where the agent does not require to explore a new path, when an unknown obstacle is detected. Rather agent can exploit the existing learned path (policy) to avoid the detected obstacle. The said performance metric is also analyzed in the following circumstances: (i) frequency of obstacle detection and frequency of exploring a new path; (ii) percentage of frequency of exploring a new path over that of obstacle detection. In Simulation 4, the Q-tree (policy) learned from simulation 1-3 after 70 iterations is migrated to the Hector platform in ROS, for planning with a different obstacle configuration. Here, the proposed SLPA is tested in Hector platform by employing the quadcopter dynamics. The simulation is conducted five times to verify the environmental uncertainties.
[079] Results for Simulation 1-4 are explained elaborately below.
[080] For Simulation 1: FIG.7A, FIG.7B, FIG.7C and FIG.7D shows the evolution of the Q-tree over 70 iterations. The obstacles are indicated by squares, and the executed path in a particular iteration is shown by a continuous line in FIG.7A, FIG.7B, FIG.7C and FIG.7D. Corresponding Q-values of the paths in FIG.7A, FIG.7B, FIG.7C and FIG.7D, are shown in FIG.8A, FIG.8B, FIG.8C and FIG.8D. The FIG. 7A shows the initial phase of Q-tree adaption with corresponding Q-values as shown in FIG.8A. It is apparent from fig.7b and fig.8b that in iteration 4 the Q-tree is adapted by avoiding the obstacles (environmental uncertainties). This Q-tree adaption continues over the iterations as shown in FIG. 7C-FIG.8C,FIG.7D-FIG.8D AND FIG.9A TO FIG.9D shows an interesting outcome of the proposed SLPA. In FIG. 9A, the proposed SLPA is employed for three random obstacles, and the policy learned after 73 iterations is shown. A continuous line is the path in FIG. 9A amidst square obstacles. Now, the same learned policy (FIG. 9A) is employed to plan in a different obstacle configuration having ten random obstacles as shown in FIG. 9B. Similar to FIG. 9A in FIG. 9B also a path (continuous line) obtained in the first iteration amidst square obstacles. It is apparent from FIG.9A, FIG.9B, FIG.9C and FIG.9D that the proposed SLPA can adapt with the environmental uncertainty efficiently.
[081] Further for Simulation 2: The three circumstances of simulation 2 are as follows: (i) It is apparent from FIG. 10, that average execution time in SLPA

(FIG. 10A) is less in almost every iteration as compared to the same in RRT* (FIG.10B). (ii) The average minimum (min) and maximum (max) execution time required for SLPA and RRT* among all the 70 iterations are listed in the column 2 and 3 of Table 2 below :

Table 2 : Average execution time for SLPA and RRT*
[082] The column 4 and 5 of Table 2 is tabulating the average mean and average standard deviation of the required average execution time for SLPA and RRT* among all the 70 iterations. It is evident from Table.2 that SLPA is outperforming RRT* in terms of max and mean average execution time. (iii) FIG. 11A and FIG.11B shows the average execution time of those iteration where obstacles are detected. It is apparent form FIG. 11A and FIG.11B that the average execution time in SLPA (FIG. 11A) is very less in almost every iteration as compared to the same in RRT* ( FIG.11B) due to SLPA exploits the learned path (policy) if available after obstacle detection in the path.
[083] Further for Simulation 3: Two circumstances of simulation 3 are given below: (i) FIG. 12B AND FIG.13B show the frequency of obstacle detected during the execution. The frequency of exploration is shown in FIG. 12A and FIG.13A. It is apparent that frequency of exploration in case of RRT* is more as compared to the same in SLPA, even though both the algorithms are facing identical obstacle configurations. (ii) FIG.14A and FIG.14B provides additional supports for SLPA in terms of adaptability towards environmental uncertainty. Here, percentage of frequency of exploration is recorded over the frequency of obstacle detection. The cause of such superiority of SLPA over RRT* is that SLPA exploits the learned paths (policy). As a result of which the frequency of exploration decreases for SLPA as the simulation progresses.
[084] For Simulation 4: Similar to Simulation 3 (FIG. 12A AND FIG.13A), here, in FIG. 15A and FIG.15B the frequency of exploration is more in case of case of RRT*, which is less in case of SLPA with identical obstacle

configuration. It is apparent from FIG. 15A and FIG.15B that SLPA is exploiting the learned path (policy) to adapt with the environmental uncertainties, hence, outperforming. FIG.16A, FIG.16B, FIG.16C AND FIG.16D shows planning outcomes of ROS simulation. The ROS simulations shown in FIG.16 to FIG.16D are conducted after executing SLPA for 70 iterations with different random obstacle configurations in Matlab. Details descriptions of FIG.16 to FIG.16D are provided in figure caption. It is apparent from this simulation that, SLPA can adapt with the environmental uncertainties efficiently.
[085] Apparently, the above simulation results demonstrate the superiority of the proposed SLPA over RRT* mainly in terms of average execution time and handling environmental uncertainties.
[086] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[087] The embodiments of present disclosure an improvisation of existing learning-planning techniques. The disclosure is a simultaneously learning and planning in a partially known/uncertain environment for an agent, without separate requirement of offline training. Once policy (sequence of steps to reach a goal) is learned for a fixed start and goal, repeated planning for identical start and goal can be avoided. Hence, the disclosed technique, learns policy (sequence of actions) using Reinforcement Learning (RL), while using RRT for planning simultaneously. The disclosed framework exploits the Voronoi bias property of Rapidly exploring Random Tree (RRT), which balances the exploration-exploitation in RL. In scenarios of environmental uncertainties ( obstacle detection or removal of a previously detected object), the RL dynamically adapts the learned policy with the help of RRT. Hence, both learning and planning are performed simultaneously to

complement each other to handle environmental uncertainties dynamically in real-time and online.
[088] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[089] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[090] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily

defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[091] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[092] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor-implemented method (300) comprising:
receiving a plurality of inputs associated with an agent in an uncertain environment, via one or more hardware processors, the plurality of inputs comprises a floor plan of the uncertain environment, a start position (s), a goal position (g), an agent dynamics (Ad), and a plurality of sampling parameters; (302) and
identifying (304) a policy for a path planning by the agent, via the one or more hardware processors, from a policy learning base for the start position (s) and the goal position (g), where the policy is a Q-tree comprising a path between the start position and the goal position, wherein the path comprises of a plurality of nodes represented by a plurality of Q-values and performing based on the identification of the policy one of below:
a) on not identifying the policy (304A), planning the path for the
agent using a path planning-policy learning technique,
comprising:
planning the path between the start position(s) and the goal position(g) based on a sampling path planning technique using the plurality of sampling parameters; (304AA) and
simultaneously learning a policy for the path planned based on a reinforcement learning technique and saving the policy in the learning base; (304AB) or
b) on identifying the policy (304B), path planning by the agent by
determining a sequence of optimal actions using the policy,
wherein an optimal action from the sequence of optimal actions
is identified by iteratively evaluating the nodes of the policy,
subject to a detection of obstacles, wherein based on the
detection of obstacles performing one of:
upon detecting an obstacle between a current state of the agent and a next state as per the path (policy), performing

the planning and policy learning technique for path planning by the agent; (304BA),or
upon identifying a removal of a previously detected obstacle between the current state of the agent and a next state as per the path (policy), performing a policy updating technique; (304BB),or
upon not detecting an obstacle between the current state of the agent and a next state as per the path (policy),
continuing to iteratively evaluate the nodes of the policy till the goal position (g) is reached based on the policy (304BC).
2. The method of claim 1, wherein the plurality of sampling parameters comprise a learning rate (α), a discounting factor (y), a point closest to the goal position (εg) and a step size (∆t).
3. The method of claim 1, wherein the agent dynamics (Ad) defines a relation between the current state of the agent with a next state of the agent by executing an action at the current state of the agent, wherein the current state of the agent is the agent’s current state on the floor plan, represented as a position and an orientation.
4. The method of claim 1, wherein each node from the plurality of nodes is represented by a Q-value with a state (c) and an action (a).
5. The method of claim 1, wherein the policy learning base is a dynamic database for learning and saving a plurality of paths planned by the agent, wherein the policy learning base is initially empty, and is filled with a plurality of policies from the paths planned by the agent at later instances.
6. The method of claim 1, wherein the sampling path planning technique includes a rapidly exploring random tree (RRT), an RRT based on a

Voronoi bias principle.
7. The method of claim 1, wherein the reinforcement learning technique is based on a combination of a Q-learning and a State–action–reward–state– action (SARSA) technique.
8. The method of claim 1, wherein the detection of obstacles is performed based on an obstacle detection technique including a view of the environment using a sensor.
9. The method of claim 1, wherein the policy updating technique includes re-initializing the corresponding Q-value and updating the policy at the current node based on the planning and policy learning technique.
10. A system (100), comprising:
an input/output interface (106);
one or more memories (102); and
one or more hardware processors (104), the one or more memories (102) coupled to the one or more hardware processors (104), wherein the one or more hardware processors (104) are configured to execute programmed instructions stored in the one or more memories, to:
receive a plurality of inputs associated with an agent in an uncertain environment, via one or more hardware processors, the plurality of inputs comprises a floor plan of the uncertain environment, a start position (s), a goal position (g), an agent dynamics (Ad), and a plurality of sampling parameters; and
identify a policy for a path planning by the agent, via the one or more hardware processors, from a policy learning base for the start position (s) and the goal position (g), where the policy is a Q-tree comprising a path between the start position and the goal position, wherein the path comprises

of a plurality of nodes represented by a plurality of Q-values and performing based on the identification of the policy one of below:
a) on not identifying the policy, planning the path for the agent
using a path planning-policy learning technique, comprising:
planning the path between the start position(s) and the goal position(g) based on a sampling path planning technique using the plurality of sampling parameters; and
simultaneously learning a policy for the path planned based on a reinforcement learning technique and saving the policy in the learning base; or
b) on identifying the policy, path planning by the agent by
determining a sequence of optimal actions using the policy,
wherein an optimal action from the sequence of optimal actions
is identified by iteratively evaluating the nodes of the policy,
subject to a detection of obstacles, wherein based on the
detection of obstacles performing one of:
upon detecting an obstacle between a current state of the agent and a next state (c´) as per the path (policy), performing the planning and policy learning technique for path planning by the agent; or
upon identifying a removal of a previously detected obstacle between the current state of the agent and a next state (c´) as per the path (policy), performing a policy updating technique; or
upon not detecting an obstacle between the current state of the agent and a next state (c´) as per the path (policy), continuing to iteratively evaluate the nodes of the policy till the goal position (g) is reached based on the policy.
11. The system of claim 10, wherein the one or more hardware processors are configured by the instructions to perform detection of obstacles based on

an obstacle detection technique including a view of the environment using a sensor.
12. The system of claim 10, wherein the one or more hardware processors are configured by the instructions for learning and saving a plurality of paths planned by the agent in the policy learning base (dynamic database), wherein the policy learning base is initially empty, and is filled with a plurality of policies from the paths planned by the agent at later instances.
13. The system of claim 10, wherein the one or more hardware processors are configured by the instructions to perform the sampling path planning technique, wherein sampling path planning technique includes the rapidly exploring random tree (RRT), the RRT based on the Voronoi bias principle.
14. The system of claim 10, wherein the one or more hardware processors are configured by the instructions to perform the reinforcement learning technique is based on a combination of the Q-learning and the State– action–reward–state–action (SARSA) technique.

Documents

Application Documents

#	Name	Date
1	202121026458-STATEMENT OF UNDERTAKING (FORM 3) [14-06-2021(online)].pdf	2021-06-14
2	202121026458-REQUEST FOR EXAMINATION (FORM-18) [14-06-2021(online)].pdf	2021-06-14
3	202121026458-FORM 18 [14-06-2021(online)].pdf	2021-06-14
4	202121026458-FORM 1 [14-06-2021(online)].pdf	2021-06-14
5	202121026458-FIGURE OF ABSTRACT [14-06-2021(online)].jpg	2021-06-14
6	202121026458-DRAWINGS [14-06-2021(online)].pdf	2021-06-14
7	202121026458-DECLARATION OF INVENTORSHIP (FORM 5) [14-06-2021(online)].pdf	2021-06-14
8	202121026458-COMPLETE SPECIFICATION [14-06-2021(online)].pdf	2021-06-14
9	202121026458-Proof of Right [20-07-2021(online)].pdf	2021-07-20
10	202121026458-FORM-26 [22-10-2021(online)].pdf	2021-10-22
11	Abstract1..jpg	2021-11-30
12	202121026458-FER.pdf	2023-10-30
13	202121026458-OTHERS [06-03-2024(online)].pdf	2024-03-06
14	202121026458-FER_SER_REPLY [06-03-2024(online)].pdf	2024-03-06
15	202121026458-DRAWING [06-03-2024(online)].pdf	2024-03-06
16	202121026458-COMPLETE SPECIFICATION [06-03-2024(online)].pdf	2024-03-06
17	202121026458-CLAIMS [06-03-2024(online)].pdf	2024-03-06
18	202121026458-ORIGINAL UR 6(1A) FORM 1-140324.pdf	2024-03-16
19	202121026458-US(14)-HearingNotice-(HearingDate-03-07-2024).pdf	2024-06-10
20	202121026458-Correspondence to notify the Controller [27-06-2024(online)].pdf	2024-06-27
21	202121026458-Written submissions and relevant documents [15-07-2024(online)].pdf	2024-07-15
22	202121026458-PatentCertificate29-07-2024.pdf	2024-07-29
23	202121026458-IntimationOfGrant29-07-2024.pdf	2024-07-29

Search Strategy

1	search_strategy_0610E_06-10-2023.pdf