Abstract: This disclosure relates generally to method and system for imitation learning using non-expert human demonstrations. Training the agents to learn from non-expert human demonstrations and modelling human skill levels for performing a task is time consuming and require expert monitoring for every sequence of actions. The proposed disclosure provides a time efficient and scalable model for training the agent modelling the human skill levels. The method initially obtains the training data from human demonstration via crowd sourcing. Further, a first set of consensus policy parameters for the training data corresponding to the task are initialized. For the obtained first set of consensus policy parameters are estimated iteratively to estimate the consensus policy for training the agent.
DESC:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR IMITATION LEARNING USING NON-EXPERT HUMAN DEMONSTRATIONS
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no. 201821025013, filed on July 04, 2018. The entire contents of the aforementioned application are incorporated herein by reference.
TECHNICAL FIELD
The disclosure herein generally relates to estimation of consensus policy, and, more particularly, to method and system for imitation learning using non-expert human demonstrations.
BACKGROUND
Imitation learning (IL) is an alternate strategy for faster policy learning that works by leveraging additional information provided through expert demonstrations. This Imitation learning (IL) implicitly gives an agent prior information by mimicking human behavior, where the agent is pre-trained under the control of an expert or a human user. However, most of the existing methods in Imitation learning (IL) focus on learning from expert demonstrations while worker skill levels may vary in crowd settings. The learner agent tries to imitate actions performed by the expert demonstrations in order to learn the policy. Practically, hiring such experts for demonstrations may be infeasible or expensive.
Most of the existing approaches for imitation learning focus on learning from expert demonstrations. Such expert demonstrations teaches the agent in a supervised way by querying the expert action for the observed states in the previous iteration. Therefore, the expert is teaching the agent how to correct mistakes performed in the earlier iteration. Also, policies learned with standard Imitation learning (IL) can be inferior to tackle the Reinforcement Learning (RL) problem directly with approaches such as policy gradients. However, such approaches limits in training the Reinforcement Learning (RL) agents learning from non-expert human demonstrations by modeling human skills into the Reinforcement Learning (RL) agent based on the estimated consensus actions at various states and updating the worker skills iteratively while training the Reinforcement Learning (RL) agent.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for is provided. The system includes a processor, an Input/output (I/O) interface and a memory coupled to the processor is capable of executing programmed instructions stored in the processor in the memory to pre-process, a plurality of tasks to obtain a training data implemented by the processor. The preprocessing step of the method comprises obtaining the data from human demonstration corresponding to the plurality of tasks via crowd sourcing. Further, a first set of consensus policy parameters are initialized for the training data corresponding to the task. Here, the first set of consensus policy parameters includes a worker annotation, a policy parameter, a worker parameter and a difficulty parameter. Further, the first set of consensus policy parameters are estimated using an iterative process to estimate the consensus policy for training the agent, wherein the iteration process comprises the following sub steps. The first set of consensus policy parameters are estimated based on the previous values and the probability for the set of possible actions over the observed states, the policy parameter, the worker annotation, the worker parameter and the difficulty parameter. In the first iteration the previous values are the initialized values of the first set of consensus policy parameters. Further, a consensus policy is computed for training the agent in accordance with the first set of consensus policy parameters. The agent is then trained using a combination of reward signals from the environment by calculating the loss from reinforcement learning algorithm and the consensus policy by obtaining the distillation loss (LD) from the trained agent and the first set of consensus policy parameters are re-estimated to obtain a second set of consensus policy parameters, wherein, the second set of consensus policy parameters comprises a new estimated value for the policy parameter, the worker parameter and the difficulty parameter, based on the distillation loss (LD) and the weighted state of action probabilities of the worker parameter and the difficulty parameter.
In another aspect, provides a method that includes a processor, an Input/output (I/O) interface and a memory coupled to the processor is capable of executing programmed instructions stored in the processor in the memory to pre-process, a plurality of tasks to obtain a training data implemented by the processor. The preprocessing step of the method comprises obtaining the data from human demonstration corresponding to the plurality of tasks via crowd sourcing. Further, a first set of consensus policy parameters are initialized for the training data corresponding to the task. Here, the first set of consensus policy parameters includes a worker annotation, a policy parameter, a worker parameter and a difficulty parameter. Further, the first set of consensus policy parameters are estimated using an iterative process to estimate the consensus policy for training the agent, wherein the iteration process comprises the following sub steps. The first set of consensus policy parameters are estimated based on the previous values and the probability for the set of possible actions over the observed states, the policy parameter, the worker annotation, the worker parameter and the difficulty parameter. In the first iteration the previous values are the initialized values of the first set of consensus policy parameters. Further, a consensus policy is computed for training the agent in accordance with the first set of consensus policy parameters. The agent is then trained using a combination of reward signals from the environment by calculating the loss from reinforcement learning algorithm and the consensus policy by obtaining the distillation loss (LD) from the trained agent and the first set of consensus policy parameters are re-estimated to obtain a second set of consensus policy parameters, wherein, the second set of consensus policy parameters comprises a new estimated value for the policy parameter, the worker parameter and the difficulty parameter, based on the distillation loss (LD) and the weighted state of action probabilities of the worker parameter and the difficulty parameter.
In yet another aspect, provides one or more non-transitory machine readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors perform actions includes preprocessing, a plurality of tasks to obtain a training data implemented by the processor. The preprocessing step of the method comprises obtaining the data from human demonstration corresponding to the plurality of tasks via crowd sourcing. Further, a first set of consensus policy parameters are initialized for the training data corresponding to the task. Here, the first set of consensus policy parameters includes a worker annotation, a policy parameter, a worker parameter and a difficulty parameter. Further, the first set of consensus policy parameters are estimated using an iterative process to estimate the consensus policy for training the agent, wherein the iteration process comprises the following sub steps. The first set of consensus policy parameters are estimated based on the previous values and the probability for the set of possible actions over the observed states, the policy parameter, the worker annotation, the worker parameter and the difficulty parameter. In the first iteration the previous values are the initialized values of the first set of consensus policy parameters. Further, a consensus policy is computed for training the agent in accordance with the first set of consensus policy parameters. The agent is then trained using a combination of reward signals from the environment by calculating the loss from reinforcement learning algorithm and the consensus policy by obtaining the distillation loss (LD) from the trained agent and the first set of consensus policy parameters are re-estimated to obtain a second set of consensus policy parameters, wherein, the second set of consensus policy parameters comprises a new estimated value for the policy parameter, the worker parameter and the difficulty parameter, based on the distillation loss (LD) and the weighted state of action probabilities of the worker parameter and the difficulty parameter.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary block implemented for imitation learning using non-expert human demonstrations, in accordance with some embodiments of the present disclosure.
FIG. 2 illustrates a flow diagram imitation learning using non-expert human demonstrations, in conjunction with FIG.3, in accordance with some embodiments of the present disclosure.
FIG.3 is a high level architecture for imitation learning using non-expert human demonstrations, in accordance with some embodiments of the present disclosure.
FIG. 4 illustrates experimental results representing a sequence of scores of various training configurations after 500 iterations (200 epochs) on Seaquest, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
The embodiments herein provides a method and system for estimating consensus policy for training an agent to imitate human behavior. The proposed method and the system enables providing a mechanism for training the agent to learn from non-expert human demonstrations. The system herein may be interchangeably referred as consensus policy estimation system. The agent may utilize an reinforcement learning agent. The method enables modelling the human skill levels into the agent, which are learned by the agent through consensus actions at various states in an environment. The method collects the training data for training the agent from users who may be non-experts in a technology area. The method also iteratively updates the worker skill levels of the agent trained using the learned weights for demonstrations over the entire training period. The embodiments herein formulate for continuous use of non-expert demonstration data for the agent by means of distillation Loss. The proposed disclosure provides an iterative algorithm to learn the consensus policy across demonstrations and uses weighted demonstrations by modeling the worker’s skill level of train the agent. Also, the loss function and regularization methods provides efficient scalability during the performance of the task.
Referring now to the drawings, and more particularly to FIG. 1 through FIG.4, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary block diagram of a system for imitation learning using non-expert human demonstrations, in accordance with some embodiments of the present disclosure. In an embodiment, the consensus policy estimation system 100 includes processor (s) 104, communication interface device(s), alternatively referred as or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the processor (s) 104. The processor (s) may be alternatively referred as one or more hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 104 is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server for verifying software code.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102, may include a repository 108. The repository 108 may store the training data obtained from the crowd sourcing. The memory 102 may further comprise information pertaining to input(s)/output(s) of each step performed by the system 100 and methods of the present disclosure. The system 100 can be configured to pre-process a plurality of tasks to obtain the training data. The training data are obtained from human demonstration corresponding to the plurality of tasks via crowd sourcing to train the agent, and thereby enabling the agent to generate one or more recommendations in response to a scenario given as input to the agent after the training of the agent. The input data that is used to train the agent is inconsistent because the input data is fetched/collected/obtained from non-experts in a particular technology area the agent is to be deployed. As the input data is collected from non-experts, the data may be inconsistent in terms of ‘accuracy (technical correctness)’, completion, terminologies used and so on. The system 100 processes the ‘inconsistent data’ removes all discrepancies, and prepares the inconsistent data to be used for training the agent which can further perform the task in an network environment.
FIG. 2 illustrates a flow diagram for imitation learning using non-expert human demonstrations, in conjunction with FIG.3, in accordance with some embodiments of the present disclosure. The steps of the method 200 of the flow diagram will now be explained with reference to the components or blocks of the system 100 in conjunction with the example architecture of the system as depicted from FIG.3, FIG.3 is a high level architecture for imitation learning using non-expert human demonstrations, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more processors 104 and is configured to store instructions for execution of steps of the method 200 by the one or more processors 104. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
At step 202 of the method 200, the processor 104 is configured to pre-process, the plurality of tasks to obtain the training dat. Further, preprocessing of the plurality of task comprises obtaining the data from human demonstration corresponding to the plurality of tasks via crowd sourcing. Initially, the training is data used for training the agent (i.e. the inconsistent data) is collected through one or more suitable interfaces. For example, a web-based platform is provided which allows users (‘non-experts’ in this context) to provide inputs to the system 100. The user inputs include demonstrations along with the observations and rewards returned by a task environment are stored in a database and are further used for training the agent.
At step 204 of the method 200, the processor 104 is configured to initialize, a first set of consensus policy parameters for the training data corresponding to the task, wherein the first set of consensus policy parameters includes a worker annotation, a policy parameter, a worker parameter and a difficulty parameter. The first set of consensus policy parameters are initial values obtained using the web based applications and are used for training the agent to imitate learners. The first set of consensus policy parameters include:
? Network parameters
di State Difficulty parameters for state i
wj Worker Parameters for worker j
zij z_ij Action by worker j in state i
si State i
ak Action k
At step 206 of the method 200, the processor 104 is configured to estimate, the first set of consensus policy parameters using the iterative process to estimate the consensus policy for training the agent. The first set of consensus policy parameters are estimated based on the previous values and the probability for the set of possible actions over the observed states, the policy parameter, the worker annotation, the worker parameter and the difficulty parameter. Here, for the first iteration the previous values are the initialized values of the first set of consensus policy parameters. Further, the agent is trained with a consensus policy computed in accordance with the first set of consensus policy parameters. The consensus policy computation is performed upon initializing the first set of consensus policy parameters, the worker and state parameters are then updated using existing network consensus policy. The worker parameter wj is used for encoding the skill level of the worker j and the state parameter is used to record the actions of worker j and subsequently performs action zij at each state. An existing network is used to update the consensus policy. The updating of parameters and consensus policy is performed until convergence. Further, the agent is trained using the consensus policy and the distillation loss (LD) is obtained from the trained agent. The distillation loss (LD) is computed using the consensus policy obtained from the agent and the policy parameter of the first set of consensus policy parameters.
In an embodiment, the first set of consensus policy parameters are re-estimated, to obtain a second set of consensus policy parameters. Here, the second set of consensus policy parameters comprises a new estimated value for the policy parameter, the worker parameter and the difficulty parameter, based on the distillation loss (LD) and the weighted state of action probabilities of the worker parameter and the difficulty parameter. Considering an example, where the consensus policy is taken as input to perform reward maximization by running k epochs and calculate proximal policy optimization (PPO) loss. PPO is a policy optimization algorithm which updates the policy parameters via the policy gradient descent to minimize the PPO loss. PPO loss is calculated based on the rewards obtained from the k epochs and promotes the actions that lead to higher rewards. Further, distillation loss (LD) is computed using the updated consensus policy and environment is updated. Distillation is a process used for neural model compression. Distillation loss (LD) is a penalty that is calculated based on the difference between the model action probabilities. The distillation loss (LD) prompts the environment to produce policies that takes actions that are similar to the consensus policy. By modelling non-expert workers, environment states occurring in their non-expert demonstrations and using the current consensus policy as a prior, a consensus policy that represents a consensus of actions of the non-expert workers is generated. By adding the Distillation Penalty, the proposed model outputs similar policies as the Consensus Policy. Then the proposed method computes a new consensus policy based on the previous values of the first set of consensus policy parameters for every iteration as performed in the step 206 to train the agent. The sub steps of the step 206 includes estimating, the first set of consensus policy parameters based on the previous values and the probability for the set of possible actions over the observed states, the policy parameter, the worker annotation, the worker parameter and the difficulty parameter. Further, for the first iteration the previous values are the initialized values of the first set of consensus policy parameters. Then, the consensus policy is computed for training the agent in accordance with the second set of consensus policy parameters. Further, the agent is trained using the consensus policy and the distillation loss (LD) is obtained from the trained agent. Then, the re-estimating, the first set of consensus policy parameters are re-estimated to obtain the second set of consensus policy parameters. Here, the second set of consensus policy parameters comprises a new estimated value for the policy parameter, the worker parameter and the difficulty parameter, based on the distillation loss (LD) and the weighted state of action probabilities of the worker parameter and the difficulty parameter.
The sub steps above are elaborated below in conjunction with FIG. 3:
Reinforcement Learning Framework
The typical framing of a RL scenario is that an agent takes actions in an environment, which is then interpreted into a reward and a representation of the state and are fed back into the agent. The Reinforcement Learning problem that is being considered in some embodiments of the present disclosure is defined by a Markov Decision Process (MDP). A MDP is characterized by a tuple < S, A, R, T, ? >, where S is the set of states, A is the set of actions, R(s, a) is the reward function, T(s, a, s') = P(s'|s, a) is the transition probability, and ? is the discount factor. An agent in a particular state, interacts with the environment by taking an action, and receives the reward while transitioning to the next state.
A goal of the agent is to learn a policy p such that the agent maximizes the future discounted reward as described in equation 1,
p=¦(argmax@p)?_t¦?^t ?_(s_t,a_t ) p[R_t ]---------(1)
Proximal Policy Optimization
In Policy Gradient Methods, the policy gradient is estimated and is used in a stochastic gradient ascent algorithm. A variant of the Policy Gradient Methods, Proximal Policy Optimization, where the policy updates are constrained by size while maximizing the clipped objective.
L_CLIP (?)=?_t [min¦ (r_t ¦ (+ ?)¦ A ^_t,clip(r_t ¦ (+ ?)¦,1-?,+ 1+?) A ^_t)]
¦(argmax@?)L_CLIP (?)------------(2)
subject to
L_KL=KL(p_(?_old ) ¦ (.|S_t ),p_? (.|S_t )+ )]=d---------(3)
where, KL is the Kullback Leibler Divergence. The constraint is applied by using a penalty as follows:
¦(argmax@?)L_PPO=L_CLIP-ßL_KL------------(4)
Let S = {si} g be the set of states observed. At each state sij a worker (a user) who sees the state (wherein the state may be displayed to the user), takes an action zij. The worker is asked to complete an episode, and every state-action pair of the worker is recorded. Let pj be the policy of the worker. If there are multiple annotations for each state, then it is easy to setup a standard consensus algorithm and to estimate the consensus policy pcns to be used for guided exploration and it is infeasible that every state has even a single annotation. Hence pj is extrapolated to the states not seen by the worker to arrive at a consensus. The embodiments of the present disclosure make use of Deep Neural Networks for this generalization of the policy to unseen states by using a convolutional neural network with three convolutional layers.
Parameterized Policy and Distillation
As the first step, the parameterized policy of the agent p?, where ? are the parameters, such as the weights and biases of the network is considered. The embodiments of the present disclosure make use of the confidence values of each action, produced by the network for better estimates of the skill and difficulty parameters. Hence, the policy is learnt in conjunction with the other parameters. Eq. (5) use the consensus policy to guide the exploration of the parameterized policy by adding a regularization loss to match the soft actions of ?? and pcns.
L_D (?)=E ^_s [p_? ¦ (·|s)-p_cns (·|s)------------(5)
The embodiments of the present disclosure scale the distillation loss by a and reduce a over time, since the optimal policy need not match the crowd policy and estimate the distillation loss by a number of random samples from the observed states.
Worker Skill and Difficulty
Let wj be the parameters encoding the skill level of the worker j. The skill level represent the confidence of the workers actions. For example, an expert should have a high skill level near compared to a non-expert. For each worker, skill levels are estimated. The action probabilities of a state are weighted according to the skill levels of the workers annotating the state and the inherent difficulty of the state i, encoded by di. The workers skill is modelled as 0 =wj = 1, where a highly skilled worker has wj near to 1. A prior of mixed Beta Distributions are considered to model different types of workers (high skill, low skill, spammers) and model 0 = di = 1 denoting the difficulty level of the state i.
Parameter Estimation
Let A = {a_k} be that set of all possible actions, the joint probability distribution over the observed states S = {S_i } is defined, worker annotations Z = {Z_ij }, policy parameters ?, worker parameters W = {W_j } and difficulty parameters, D = {d_i } as
?(Z,W,D,?|S)=?(?) ?_i¦( ?(d_i)?_k¦??(a_k ?| s_i,?))
?_j¦?? (?_j)? ?_(i,j)¦?(? (z_ij ? |s_i,d_i,?_j )-------(6)
Next step is to estimate the parameters by alternating maximization algorithms for obtaining consensus policy as represented below in equation 7 and 8,
p ^_cns (a_k¦s_i,? ^ ) ?_j¦? (z_ij¦a_k,(d_i ) ^,w ^_j ) ------- (7)
(a_i ) ^=¦(argmax@a_k )p ^_cns (a_k |s_i ) ---------------- (8)
The new estimated value for the difficulty parameter of the second set of consensus policy parameters are obtained using the probabilities of the maximized arguments of the consensus policy, the worker annotations and the difficulty parameter value of the first set of consensus policy parameters and the new estimated value of the worker parameter as represented below in equation 9,
(d_i ) ^=¦(argmax@d_i )?(d_i ) ?_j¦? (d_i ) ?_j¦? (z_ij¦(a_i ) ^,d_i,w ^_j )---------(9)
Further, the new estimated value for the worker parameter of the second set of consensus policy parameters are obtained using the probabilities of the maximized arguments of the consensus policy, the worker annotations and the worker parameter value of the first set of consensus policy parameters and the new estimated value of the difficulty parameter as represented below in equation 10,
w ^_j=¦(argmax@w_j )?(w_j ) ?_i¦? (z_ij¦(a_i ) ^,(d_i ) ^,w_j )--------(10)
Further, the new estimated value for the policy parameter of the second set of consensus policy parameters are computed using the maximized arguments of the loss obtained from the proximal policy optimization and the distillation loss of the policy parameter of the first set of consensus policy parameters as represented below in equation 11,
? ^=¦(argmax@?)(L_PPO (?)-aL_D (?))-------(11)
where p(ak, si, ? ^) is the confidence output from the agent, and ?(di), ?(wj) are priors, p(zij a, "d" ^i, ? ^j) is probability of worker j taking action zij given that a is the optimal action, (8) is solved by stochastic gradient ascent.
?(z_ij¦a,(d_i ) ^,w ^_j )=
ifz_ij=a:w_j (1-d_i )
else:(1-w_j (1-d_i ))/(|A|-1) -------------------- (12)
FIG. 4 illustrates experimental results representing a sequence of scores of various training configurations after 500 iterations (200 epochs) on Seaquest, in accordance with some embodiments of the present disclosure. Experiments were performed on three Atari games: one where humans were better than agents, one where the agent was significantly better, and one where both were performing similarly. Scores were obtained for human performance and for the agent performance. Based on the ratio of human to agent score, Seaquest (ratio=11.44) was chosen which was available on OpenAI Gym. Seaquest has high ratio indicating that humans can perform better than machines on this game. Hence, there is a possibility for incorporating techniques used by humans to better train the agent. As the preliminary step, a plurality of players were considered and store their gameplay data, including actions performed, states and rewards generated by the environment. Only, the controls of the game to the players were explained and did not elaborate on the specific game mechanics. This was done to ensure that the players explored the game’s reward mechanisms and would improve during the course of their episode. After collecting the demonstration data, the agent was trained in four configurations. In the first configuration, without incorporating the distillation loss during training to simulate the vanilla RL training without demonstration data. Next, by setting equal skill levels to all the workers, and not updating them during training. This was a baseline to understand how the algorithm performed if the worker skill level wasn’t modelled and all demonstrations were treated as oracle demonstrations. Finally running two configurations where the worker skill levels were updated at a low frequency (an update every 10 iterations) and high frequency (an update every iteration). As explained in Figure 3, updating the worker parameters with a low frequency gave us the fastest training improvement and highest average score after 50 iterations, each iteration being 4 epochs. Treating all workers as experts (equal high skill) may hamper effectiveness of training the system 100, which in turn necessitates worker modelling for high performance levels.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiments of the present disclosure herein addresses unresolved problem of training the agent to learn from non-expert human demonstrations. The proposed disclosure enables in modeling the worker skill levels and using weighted demonstrations during training helps in improving the speed of the training significantly. In every iteration, the expert teaches the agent to correct the mistakes performed in the earlier iteration. The method also obtains workers action probability distributions at each state. Also, the skill levels of while training the agent gets updated over the entire training period iteratively. The expert teaches to correct the mistakes performed in each iteration. The proposed disclosure improves the performance of training the agent for every iteration.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
,CLAIMS:
1. A processor (202) implemented method for estimating a consensus policy for imitation learning using non-expert human demonstrations, wherein the method comprises:
pre-processing, a plurality of tasks to obtain a training data implemented by the processor (202), wherein preprocessing comprises obtaining the data from human demonstration corresponding to the plurality of tasks via crowd sourcing;
initializing, a first set of consensus policy parameters for the training data corresponding to the task (204), wherein the first set of consensus policy parameters includes a worker annotation, a policy parameter, a worker parameter and a difficulty parameter; and
estimating, the first set of consensus policy parameters using an iterative process to estimate the consensus policy for training the agent (206), wherein the iteration process comprises,
estimating, the first set of consensus policy parameters based on the previous values and the probability for the set of possible actions over the observed states, the policy parameter, the worker annotation, the worker parameter and the difficulty parameter,
wherein, for the first iteration the previous values are the initialized values of the first set of consensus policy parameters;
computing, a consensus policy for training the agent in accordance with the first set of consensus policy parameters;
training, the agent using the consensus policy, combination of reward signals from the environment and obtaining the distillation loss (LD) from the trained agent; and
re-estimating, the first set of consensus policy parameters, to obtain a second set of consensus policy parameters,
wherein, the second set of consensus policy parameters comprises a new estimated value for the policy parameter, the worker parameter and the difficulty parameter, based on the distillation loss (LD) and the weighted state of action probabilities of the worker parameter and the difficulty parameter.
2. The method as claimed in claim 1, wherein the distillation loss (LD) is computed using the consensus policy obtained from the agent and the policy parameter of the first set of consensus policy parameters.
3. The method as claimed in claim 1, wherein the new estimated value for the difficulty parameter of the second set of consensus policy parameters are obtained using the probabilities of the maximized arguments of the consensus policy, the worker annotations and the difficulty parameter value of the first set of consensus policy parameters and the new estimated value of the worker parameter.
4. The method as claimed in claim 1, wherein the new estimated value for the worker parameter of the second set of consensus policy parameters are obtained using the probabilities of the maximized arguments of the consensus policy, the worker annotations and the worker parameter value of the first set of consensus policy parameters and the new estimated value of the difficulty parameter.
5. The method as claimed in claim 1, wherein the new estimated value for the policy parameter of the second set of consensus policy parameters are computed using the maximized arguments of the loss obtained from the proximal policy optimization and the distillation loss of the policy parameter of the first set of consensus policy parameters.
6. A system (100), the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106);
and one or more processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the processor (104) is configured by the instructions to:
pre-process, a plurality of tasks to obtain a training data implemented by the processor(202), wherein preprocessing comprises obtaining the data from human demonstration corresponding to the plurality of tasks via crowd sourcing;
initialize, a first set of consensus policy parameters for the training data corresponding to the task (204), wherein the first set of consensus policy parameters includes a worker annotation, a policy parameter, a worker parameter and a difficulty parameter; and
estimate, the first set of consensus policy parameters using an iterative process to estimate the consensus policy for training the agent (206), wherein the iteration process comprises,
estimating, the first set of consensus policy parameters based on the previous values and the probability for the set of possible actions over the observed states, the policy parameter, the worker annotation, the worker parameter and the difficulty parameter,
wherein, for the first iteration the previous values are the initialized values of the first set of consensus policy parameters;
computing, a consensus policy for training the agent in accordance with the first set of consensus policy parameters;
training, the agent using the consensus policy, combination of reward signals from the environment and obtaining the distillation loss (LD) from the trained agent; and
re-estimating, the first set of consensus policy parameters, to obtain a second set of consensus policy parameters,
wherein, the second set of consensus policy parameters comprises a new estimated value for the policy parameter, the worker parameter and the difficulty parameter, based on the distillation loss (LD) and the weighted state of action probabilities of the worker parameter and the difficulty parameter.
7. The system (100) as claimed in claim 6, wherein the distillation loss (LD) is computed using the consensus policy obtained from the agent and the policy parameter of the first set of consensus policy parameters.
8. The system (100) as claimed in claim 6, wherein the new estimated value for the difficulty parameter of the second set of consensus policy parameters are obtained using the probabilities of the maximized arguments of the consensus policy, the worker annotations and the difficulty parameter value of the first set of consensus policy parameters and the new estimated value of the worker parameter.
9. The system (100) as claimed in claim 6, wherein the new estimated value for the worker parameter of the second set of consensus policy parameters are obtained using the probabilities of the maximized arguments of the consensus policy, the worker annotations and the worker parameter value of the first set of consensus policy parameters and the new estimated value of the difficulty parameter.
10. The system (100) as claimed in claim 6, wherein the new estimated value for the policy parameter of the second set of consensus policy parameters are computed using the maximized arguments of the loss obtained from the proximal policy optimization and the distillation loss of the policy parameter of the first set of consensus policy parameters.
| # | Name | Date |
|---|---|---|
| 1 | 201821025013-STATEMENT OF UNDERTAKING (FORM 3) [04-07-2018(online)].pdf | 2018-07-04 |
| 2 | 201821025013-PROVISIONAL SPECIFICATION [04-07-2018(online)].pdf | 2018-07-04 |
| 3 | 201821025013-FORM 1 [04-07-2018(online)].pdf | 2018-07-04 |
| 4 | 201821025013-DRAWINGS [04-07-2018(online)].pdf | 2018-07-04 |
| 5 | 201821025013-Proof of Right (MANDATORY) [02-08-2018(online)].pdf | 2018-08-02 |
| 6 | 201821025013-FORM-26 [05-09-2018(online)].pdf | 2018-09-05 |
| 7 | 201821025013-ORIGINAL UR 6(1A) FORM 1-080818.pdf | 2018-12-04 |
| 8 | 201821025013-ORIGINAL UR 6(1A) FORM 26-120918.pdf | 2019-02-13 |
| 9 | 201821025013-FORM 3 [01-07-2019(online)].pdf | 2019-07-01 |
| 10 | 201821025013-FORM 18 [01-07-2019(online)].pdf | 2019-07-01 |
| 11 | 201821025013-ENDORSEMENT BY INVENTORS [01-07-2019(online)].pdf | 2019-07-01 |
| 12 | 201821025013-DRAWING [01-07-2019(online)].pdf | 2019-07-01 |
| 13 | 201821025013-COMPLETE SPECIFICATION [01-07-2019(online)].pdf | 2019-07-01 |
| 14 | Abstract1.jpg | 2019-08-16 |
| 15 | 201821025013-OTHERS [04-08-2021(online)].pdf | 2021-08-04 |
| 16 | 201821025013-FER_SER_REPLY [04-08-2021(online)].pdf | 2021-08-04 |
| 17 | 201821025013-COMPLETE SPECIFICATION [04-08-2021(online)].pdf | 2021-08-04 |
| 18 | 201821025013-CLAIMS [04-08-2021(online)].pdf | 2021-08-04 |
| 19 | 201821025013-ABSTRACT [04-08-2021(online)].pdf | 2021-08-04 |
| 20 | 201821025013-FER.pdf | 2021-10-18 |
| 1 | SearchStrategy_201821025013E_19-02-2021.pdf |