1 Introduction
Reinforcement learning (RL) is a general framework for the development of artificial agents that learn to make complex decisions by interacting with their environment. In recent years, offtheshelf RL algorithms have achieved stateoftheart performance on simulated tasks such as Atari games (Mnih et al., 2015) and realworld applications (Gu et al., 2017). However, sample efficiency remains a genuine concern for RL in general, and particularly modelfree RL.
To address this concern, transfer learning tries to reduce the number of samples required to learn a new (target) task by reusing previously acquired knowledge or skills from other similar (source) tasks (Taylor and Stone, 2009). The knowledge transferred can be represented in many different ways, including lowlevel information such as sample data (Lazaric, 2008), or highlevel knowledge such as value functions (Taylor et al., 2005), task features (Banerjee and Stone, 2007), or skills (Konidaris and Barto, 2007). A major stream of literature focuses on reusing policies (Brys et al., 2015; Parisotto et al., 2015; Barreto et al., 2017; Gupta et al., 2017) because it is intuitive, direct, and does not rely on value functions that can be difficult to transfer, or are not always available by design of the RL algorithm.
Early work on transfer learning in RL focused on the efficient transfer of knowledge between a single source and target task (Taylor et al., 2007). However, the knowledge transferred from multiple source tasks can be more effective (Fernández and Veloso, 2006; Comanici and Precup, 2010; Rosman et al., 2016). Despite recent developments in transfer learning theory, few existing approaches are able to reason about task similarity separately in different parts of the state space. This could lead to significant improvements in knowledge transfer, since only the information from relevant regions of each source task can be selected for transfer in a mixandmatch manner (Taylor and Stone, 2009).
In this paper, we assume the states, actions and rewards are identical between source and target tasks, but their dynamics can vary. Furthermore, the dynamics and optimal policies of the source tasks are estimated prior to transfer. Such formulations are often motivated by practical applications. In the field of maintenance, for instance, practitioners often rely on a digital reconstruction of the machine and its surroundings, called a digital twin, to assist maintenance on the physical asset (Lund et al., 2018). Here, different source tasks could represent models of optimal control under a wide range of conditions (exogenous events such as weather, or physical properties of the machine and other endogenous factors) corresponding to common state, action, and reward, but differing transition dynamics. Such simulations are routinely developed in other areas as well, such as drug discovery (Durrant and McCammon, 2011), robotics (Christiano et al., 2016) or manufacturing (Zhang et al., 2017).
To enable contextual policy transfer in RL, we introduce a novel Bayesian framework for autonomously identifying and combining promising subregions from multiple source tasks. This is done by placing statedependent Dirichlet priors over source task models, and updating them using state trajectories sampled from the true target dynamics while learning the target policy. Specifically, posterior updates are informed by the likelihood of the observed transitions under the source task models. However, explicit knowledge of target dynamics is not necessary, making our approach modelfree with respect to the target task. Furthermore, naive tabulation of statedependent priors is intractable in large or continuous state space problems, so we parameterize them as deep neural networks. This architecture, inspired by the
mixture of experts (Jacobs et al., 1991; Bishop, 1994), serves as a surrogate model that can inform the statedependent contextual selection of source policies for locally exploring promising actions in each state.Our approach has several key advantages over other existing methods. Firstly, Bayesian inference allows priors to be specified over source task models. Secondly, the mixture model network can benefit from advances in deep network architectures, such as CNNs
(Krizhevsky et al., 2012). Finally, our approach separates reasoning about task similarity from policy learning, so that it can be easily combined with different forms of policy reuse (Fernández and Veloso, 2006; Brys et al., 2015) and is easy to interpret as we demonstrate later in our experiments.The main contributions of this paper are threefold:

We introduce a contextual mixture model to efficiently learn statedependent posterior distributions over source task models;

We show how the trained mixture model can be incorporated into existing policy reuse methods, such as directed exploration (MAPSE) and reward shaping (MARS);

We demonstrate the effectiveness and generality of our approach by testing it on problems with discrete and continuous spaces, including physics simulations.
1.1 Related Work
Using statedependent knowledge to contextually reuse multiple source policies is a relatively new topic in transfer learning. Rajendran et al. (2015) used a soft attention mechanism to learn statedependent weightings over source tasks, and then transferred either policies or values. Li and Kudenko (2018) proposed TwoLevel Qlearning, in which the agent learns to select the most trustworthy source task in each state in addition to the optimal action. The selection of source policies can be seen as an outer optimization problem. The ContextAware Policy Reuse algorithm of Li et al. (2019) used options to represent selection of source policies as well as target actions, learning Qvalues and termination conditions simultaneously. However, these two papers are limited to criticbased approaches with finite action spaces. To fill this gap, Kurenkov et al. (2019) proposed ACTeach
, which uses Bayesian DDPG to learn probability distributions over Qvalues corresponding to student and teacher actions, and Thompson sampling for selecting exploratory actions from them. However, their inference technique is considerably different from ours, and is specific to the actorcritic setting. Our paper complements existing work by using source task dynamics rather than Qvalues to reason about task similarity, and is compatible with both modelbased and modelfree RL.
Potentialbased reward shaping (PBRS) was first introduced in Ng et al. (1999) for constructing dense reward signals without changing the optimal policies. Later, Wiewiora et al. (2003) and Devlin and Kudenko (2012) extended this to actiondependent and timevarying potential functions, respectively. More recently, Harutyunyan et al. (2015) combined these two extensions into one framework and used it to incorporate arbitrary reward functions into PBRS. Brys et al. (2015) made the connection between PBRS and policy reuse, by turning a single source policy into a binary reward signal and then applying Harutyunyan et al. (2015). Later, Suay et al. (2016) recovered a potential function from policy demonstrations directly using inverse RL. Our paper extends Brys et al. (2015) by reusing multiple source policies in a statedependent way that is compatible with modern deep RL techniques. Thus, our paper advances the stateoftheart in policy transfer and contributes to the expanding body of theoretical research into reward shaping.
2 Preliminaries
Markov Decision Process
We follow the framework of Markov decision processes (MDPs) (Puterman, 2014), defined as fivetuples where: is a set of states, is a set of actions, are the state dynamics, is a bounded reward function, and is a discount factor. In deterministic problems, the state dynamics are typically represented as a deterministic function . The objective of an agent is to find an optimal deterministic policy that maximizes the discounted cumulative reward over the planning horizon , where .
Reinforcement Learning
In the reinforcement learning setting, neither nor are assumed to be known by the agent apriori. Instead, an agent collects data by interacting with the environment through a randomized exploration policy , where denotes a probability distribution over . In modelbased RL (MBRL), the agent uses this data to first learn and , and then uses this to learn the optimal policy . In modelfree RL, an agent learns the optimal policy directly without knowledge of or . Modelfree RL algorithms typically fall into one of two categories. Temporal difference (TD) or Montecarlo (MC) algorithms approximate using a tabular (Watkins and Dayan, 1992) or deep neural network representation (Mnih et al., 2015). Policy gradient methods, on the other hand, learn the optimal policy directly (Sutton et al., 2000).
Model Learning
In modelbased RL, the dynamics model or is typically parameterized as a deep neural network and trained through repeated interaction with the environment. It typically returns an estimate of the next state directly , or approximates its distribution using, for instance, a Gaussian model . Subsequently, samples from the trained dynamics model can be used to augment the real experience when training the policy (Sutton, 1991; Peng et al., 2018; Kaiser et al., 2019), although other methods for reusing dynamics exist (Todorov and Li, 2005; Levine and Koltun, 2013; Heess et al., 2015; Nagabandi et al., 2018).
Transfer Learning
We are interested in solving the following transfer learning problem. A library of source tasks, and a single target task, are provided with identical , and ^{1}^{1}1In practice, we only require that goals of source and target tasks are shared., but different dynamics. Each source task is then solved to obtain estimates of the optimal policy , as well as the dynamics or . The main objective of this paper is to make use of this knowledge to solve the new target task in an efficient online manner.
3 ModelAware Policy Reuse
Assuming the dynamics models have been learned for all source tasks, we now proceed to model and learn statedependent contextual similarity between source tasks and a target task.
3.1 Contextual Mixture Models
We first introduce a statedependent prior over combinations of source task models, that tries to match the true (unknown) target dynamics using transitions data collected from the target environment up to current time . Here, consists of nonnegative elements such that . Using combinations to model uncertainty in source task selection can be viewed as Bayesian model combination, which allows inference over a general space of hypotheses and has been shown to exhibit stable convergence in practice (Minka, 2000; Monteith et al., 2011).
The motivation for learning a statedependent prior is that the optimal behaviour in the target task may be locally similar to one source task in one region of the state space, but a different source task in another region of the state space. By reasoning about task similarity locally in different areas of the state space, a reinforcement learning agent can make more efficient use of source task knowledge. Theoretically, better estimates of dynamics lead to better estimates of the value function and hence the optimal policy.
Theorem 1.
Consider an MDP with finite and and bounded reward . Let
be the reward function in vector form,
be an estimate of the transition probabilities induced by a policy in matrix form, and be the corresponding value function in vector form. Also, let and be the corresponding values under the true dynamics. Then for any policy ,This result justifies our methodology of using source task dynamics similarity to guide statedependent policy reuse from source tasks. A proof is provided in Appendix Appendix.
In this setting, exact inference over is intractable, so we model using a surrogate probability distribution. In particular, since each realization of is a discrete probability distribution, a suitable prior for in each state is a Dirichlet distribution with density
(1) 
where are mappings from to .
Next, by averaging out the uncertainty in , we can obtain an aposteriori estimator of target dynamics:
(2) 
In the following sections, we will instead refer to the following normalized form of (2)
(3) 
where
is the mean of a Dirichlet random variable with density (
1) and . Therefore, the posterior estimate of target dynamics (3) can be represented as a mixture of source task models.In a tabular setting, it is feasible to maintain separate estimates of per state using Bayes’ rule
(4) 
using sampling (Andrieu et al., 2003) or variational inference (Gimelfarb et al., 2018). However, maintaining (4) for large or continuous state spaces presents inherent computational challenges. Furthermore, it is not practical to cache , but rather to process each sample online or in batches.
3.2 Deep Contextual Mixture Models
Fortunately, as (3) shows, the posterior mean is a sufficient estimator of . Therefore, we can approximate
directly using a feedforward neural network
with parameters that can then be optimized using gradient descent. Now it is no longer necessary to store all , since each sample can be processed online or in batches. Furthermore, since approximates and fully parameterizes the estimated model (3), we can write .The input of is a vectorized state representation of , and the outputs are fed through the softmax function
to guarantee that and .
In order to learn the parameters , we can minimize the empirical negative loglikelihood function^{2}^{2}2Note that this equates to optimizing the posterior with a uniform prior . using gradient descent, given by (3) as:
(5) 
The gradient of has a Bayesian interpretation. For one observation , it can be written as
(6) 
where
(7) 
Here, we can interpret as a prior. Once a new sample is observed, we compute the posterior using Bayes rule (7), and is updated according to the difference between prior and posterior, scaled by state features . Hence, gradient updates in space can be viewed as projections of posterior updates in space, and the geometry of this learning process is illustrated in Figure 1. Regularization of can be incorporated naturally by introducing an informative prior (e.g. isotropic Gaussian, Laplace) in the loss (5), and can lead to smoother posteriors.
3.3 Conditional RBF Networks
In continuousstate tasks with deterministic transitions, correspond to Dirac measures, in which case we only have access to models that predict the next state. In order to tractably update the mixture model in this setting, we assume that, given source task is the correct model of target dynamics, the probability of observing a transition from state to state is a decreasing function of the prediction error . More formally, given an arbitrarily small region ,
(8) 
where can be interpreted as a normalized^{3}^{3}3Technically, we could only require that , as the likelihood function need not be a valid probability density. radial basis function. A popular choice of , implemented in this paper, is the Gaussian kernel, which for is
(9) 
In principle, could be modeled as an additional output of the mixture model, , and learned from data (Bishop, 1994), although we treat it as a constant in this paper.
By using (8) and following the derivations leading to (2), we obtain the following result in direct analogy to (3)
(10) 
Consequently, the results derived in the previous sections, including the mixture model and loss function for
, hold by replacing with . Furthermore, since (10) approximates the target dynamics as a mixture of kernel functions, it can be viewed as a conditional analogue of the RBF network (Broomhead and Lowe, 1988). It remains to show how to make use of this model and the source policy library to solve a new target task.3.4 Policy Reuse
The most straightforward approach is to sample a source policy according to and follow it in state . To allow for random exploration, the agent is only allowed to follow this action with probability , initially set to a high value and annealed over time to maximize the use of source policies early in training (Fernández and Veloso, 2006; Li and Zhang, 2018). The resulting behaviour policy is suitable for any offpolicy RL algorithm. We call this algorithm ModelAware Policy ReuSe for Exploration (MAPSE), and present the pseudocode in Algorithm 1.
However, such an approach has several shortcomings. Firstly, it is not clear how to anneal , since the underlying quantity is learned over time and is nonstationary. Secondly, using the recommended actions too often can lead to poor test performance, since the agent may not observe suboptimal actions enough times to learn to avoid them in testing. Finally, since efficient credit assignment is particularly difficult in sparse reward problems (Seo et al., 2019), it may limit the effectiveness of action recommendation.
Instead, motivated by the recent success of dynamic reward shaping (Brys et al., 2015), we directly modify the original reward to , where:
(11)  
Here, is a positive constant that defines the strength of the shaped reward signal, and can be tuned for each problem, and
is chosen to be the posterior probability that action
would be recommended by a source policy in state at time . The sign of the potential function is negative to encourage the agent to take the recommended actions, following the argument in Harutyunyan et al. (2015). A more elaborate approach would update using an onpolicy valuebased RL algorithm, as explained in the aforementioned paper. By repeating the derivations leading to (2), we can derive a similar expression for as follows:(12) 
Note that (3.4) reduces to Brys et al. (2015) when and source policies are deterministic, e.g. . Unlike MAPSE, this approach can also be applied onpolicy, and is guaranteed to converge (Devlin and Kudenko, 2012). We call this approach ModelAware Reward Shaping (MARS), and present the training procedure of both MARS and MAPSE in Algorithm 2.
The proposed framework is general and modular, since it can be combined with any standard RL algorithm in modelfree and modelbased settings. Furthermore, the computational cost of processing each sample is a linear function of the cost of evaluating the source task dynamics and policies, and can be efficiently implemented using neural networks. In a practical implementation, would be trained with a higher learning rate or larger batch size than the target policy, to make better use of source task information earlier.
Many extensions to the current framework are possible. For instance, to reduce the effect of incomplete or negative transfer, it is possible to estimate the target task dynamics or , and include it as an additional ()st component in the mixture (3). If this model can be estimated accurately, it can also be used to update the agent directly as suggested in Section 2. Further improvements for MARS could be obtained by learning a secondary Qvalue function for the potentials following Harutyunyan et al. (2015). We do not investigate these extensions in this paper, which can form interesting topics for future study.
4 Empirical Evaluation
In this section, we evaluate, empirically, the performance of both MAPSE (Algorithm 1) and MARS (11)(3.4) in a typical RL setting (Algorithm 2). In particular, we are interested in answering the following research questions:

Does the mixture model learn to select the most relevant source task(s) in each state?

Does MARS (and possibly MAPSE) achieve better test performance, relative to the number of transitions observed, over existing baselines?

Does MARS lead to better transfer than MAPSE?
In order to answer these questions, we implement Algorithm 2 in the context of tabular Qlearning (Watkins and Dayan, 1992) and DQN (Mnih et al., 2015) with MAPSE and MARS. To ensure fair comparison with relevant baselines, we include one stateoftheart contextfree and contextual policy reuse algorithm:

CAPS: a contextual optionbased algorithm recently proposed in Li et al. (2019);

UCB: a contextfree UCB algorithm recently proposed in Li and Zhang (2018); here performance depends on , so we run a grid search for , where and is the episode, and report the best case (likewise for MAPSE);

1, 2…: PBRS using a binary reward derived from each source policy in isolation, as suggested in Brys et al. (2015);

Q: Qlearning or DQN with no transfer.
To help us answer the research questions above, we consider three variants of existing problems, TransferMaze, TransferCartPole, and TransferSparseLunarLander, that are explained in the subsequent subsections. All experiments are run using Keras with TensorFlow backend. Full details are provided in Appendix B. Experiment Settings.
4.1 TransferMaze
The first experiment consists of a 30by30 discrete maze environment with four subrooms as illustrated in Figure 2. The four possible actions left, up, right, down move the agent one cell in the corresponding direction, but have no effect if the destination contains a wall. The agent incurs a penalty of 0.02 for hitting a wall, and otherwise 0.01. Upon reaching the goal, the agent receives +1.0 and the game ends. The goal is to find the shortest path from the green cell to the red cell in the target maze shown in Figure 2. The source tasks, as shown in Figure 2, each correctly model the interior of one room. As a result, only a contextaware algorithm can learn to utilize the source task knowledge correctly.
For each source task, we use Qlearning to learn optimal policies . The dynamics are trained using lookup tables and hence . The target policies are learned using Qlearning with each baseline. The experiment is repeated 20 times and the aggregated results are reported in Figure 2(a). Figure 2(b) plots the statedependent posterior learned over time on a single trial.
Smoothed mean test performance (number of steps needed to reach the goal) and standard error for 20 trials on the TransferMaze domain solved with Qlearning. As seen on the right, transferring from a single source policy individually prevents the agent from converging in the specified number of samples. 

4.2 TransferCartPole
We next consider a variation of the continuousstate CartPole control problem, where the force applied to the cart is not constant, but varies with cart position according to the equation . One way to interpret this is that the surface is not frictionless, but contains slippery () and rough () patches. To learn better policies, the agent can apply half or full force to the cart in either direction (4 possible actions). As a result, the optimal policy in each state depends on the surface. The problem is made more difficult by uniformly initializing , to require the agent to generalize control to both surfaces. In the first two source tasks, agents balance the pole only on rough () and slippery () surfaces, respectively. In the third source task, the pole length is doubled and .
Following Mnih et al. (2015), Qvalues are approximated using feedforward neural networks, and we use randomized experience replay and target Qnetwork with hard updates. State dynamics are parameterized as feedforward neural networks and trained in a supervised way using the MSE loss using batches drawn randomly from the buffer. To learn , the likelihood is estimated using (9) with fixed . For CAPS, we follow Li et al. (2019) and only train the last layer when learning termination functions. We tried different learning rates in and picked the best one. The test performance is illustrated in Figure 3(a). Figure 3(b) plots the statedependent posterior learned over time.


4.3 TransferSparseLunarLander
The final experiment consists of a variation of the LunarLanderv2 domain from OpenAiGym with sparse reward, in which the reward shape is deferred until the end of each episode. This is a highdimensional continuous stochastic problem with sparse reward, representative of many realworld problems where it is considerably harder to learn correct dynamics, and hence transfer skills effectively. The first source task teaches the lander to hover above the landing pad at a fixed region in space (
), and fails if the lander gets too close to the ground. The second source task places the rover at a random location () above the landing pad, and the agent learns to land the craft safely. The third source task is equivalent to LunarLanderv2, except the mass of the craft is halved. A successful transfer experiment, therefore, should learn to transfer skills from the hover and land source tasks depending on altitude.To solve this problem, we use the same setup as in TransferCartPole. Here, state transitions are stochastic and the moon surface is generated randomly in each episode, so dynamics are learned on noisy data. Observed state components are clipped to
to reduce the effect of outliers, and
tanh output activations predict the first 6 state components (position and velocity) and sigmoid the last two (leg contact with ground). Furthermore, the dynamics are learned offline on data collected during policy training to avoid the moving target problem and improve learning stability. We obtain MSE of for hover dynamics and for other source task dynamics, highlighting the difficulty of correctly learning accurate dynamics for ground contact. The test performance averaged over 10 trials is shown in Figure 5. Figure 6 illustrates the output of the mixture on 10 state trajectories obtained during training.4.4 Discussion
MARS consistently outperforms all baselines, in terms of sample efficiency and solution quality, and MAPSE outperforms UCB, as shown in Figures 2(a), 3(a) and 5. Figures 2(b), 3(b) and 6 provide one possible explanation for this, namely the ability of the mixture model to converge to good mixtures even when presented with imperfect source dynamics as in TransferSparseLunarLander. Furthermore, on all three domains, MARS achieves asymptotic performance comparable to, or better, than the best single potential function etc. Interestingly, although MARS consistently outperforms CAPS, MAPSE only does so on TransferSparseLunarLander. This reaffirms our hypothesis in Section 3.4 that reward shaping can improve generalization on test data with little tuning. Furthermore, we conjecture that the inconsistent performance of CAPS is due to its reliance on fluctuating Qvalues, that is mitigated in MARS and MAPSE by their reliance instead on more stable samples of the dynamics.
5 Conclusion
We investigated transfer of policies from multiple source tasks with identical goals but different dynamics. We showed, theoretically, how dynamics are related to policy values. We then used estimates of source task dynamics to contextually measure similarity between source and target tasks using a deep mixture model. We introduced two ways to use this information to improve training in the target task. Experiments showed strong performance and the advantages of leveraging more stable dynamics, as well as reward shaping, as a means of contextual transfer. Several possible extensions of this work were discussed in Section 3.4. It is also possible to generalize this work to MDPs with different goals (Schaul et al., 2015) or different state or action spaces (Taylor et al., 2007).
References

An introduction to mcmc for machine learning
. Machine learning 50 (12), pp. 5–43. Cited by: §3.1.  General game learning using knowledge transfer. In IJCAI, pp. 672–677. Cited by: §1.
 Successor features for transfer in reinforcement learning. In NIPS, pp. 4055–4065. Cited by: §1.
 Mixture density networks. Technical report Citeseer. Cited by: §1, §3.3.

Radial basis functions, multivariable functional interpolation and adaptive networks
. Technical report Royal Signals and Radar Establishment Malvern (United Kingdom). Cited by: §3.3.  Policy transfer using reward shaping. In AAMAS, pp. 181–188. Cited by: §1.1, §1, §1, §3.4, §3.4, item 3.
 Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §1.
 Optimal policy switching algorithms for reinforcement learning. In AAMAS, pp. 709–714. Cited by: §1.
 Dynamic potentialbased reward shaping. In AAMAS, pp. 433–440. Cited by: §1.1, §3.4.
 Molecular dynamics simulations and drug discovery. BMC biology 9 (1), pp. 71. Cited by: §1.
 Probabilistic policy reuse in a reinforcement learning agent. In AAMAS, pp. 720–727. Cited by: §1, §1, §3.4.
 Reinforcement learning with multiple experts: a bayesian model combination approach. In NIPS, pp. 9528–9538. Cited by: §3.1.
 Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In ICRA, pp. 3389–3396. Cited by: §1.
 Learning invariant feature spaces to transfer skills with reinforcement learning. In ICLR, Cited by: §1.
 Expressing arbitrary reward functions as potentialbased advice. In AAAI, Cited by: §1.1, §3.4, §3.4.
 Learning continuous control policies by stochastic value gradients. In NIPS, pp. 2944–2952. Cited by: §2.
 Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §1.
 Modelbased reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §2.
 Building portable options: skill transfer in reinforcement learning.. In IJCAI, Vol. 7, pp. 895–900. Cited by: §1.
 Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §1.
 ACteach: a bayesian actorcritic method for policy learning with an ensemble of suboptimal teachers. In ICRA, Cited by: §1.1.
 Knowledge transfer in reinforcement learning. Ph.D. Thesis, Politecnico di Milano. Cited by: §1.
 Guided policy search. In ICML, pp. 1–9. Cited by: §2.
 Reinforcement learning from multiple experts demonstrations. In ALA, Vol. 18. Cited by: §1.1.
 Contextaware policy reuse. In AAMAS, pp. 989–997. Cited by: §1.1, item 1, §4.2.
 An optimal online method of selecting source policies for reinforcement learning. In AAAI, Cited by: §3.4, item 2.
 Digital twin interface for operating wind farms. Google Patents. Note: US Patent 9,995,278 Cited by: §1.
 Bayesian model averaging is not model combination. Available electronically at http://www. stat. cmu. edu/minka/papers/bma. html, pp. 1–2. Cited by: §3.1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §2, §4.2, §4.
 Turning bayesian model averaging into bayesian model combination. In IJCNN, pp. 2657–2663. Cited by: §3.1.
 Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In ICRA, pp. 7559–7566. Cited by: §2.
 Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §1.1.
 Actormimic: deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. Cited by: §1.
 Deep dynaq: integrating planning for taskcompletion dialogue policy learning. In ACL, pp. 2182–2192. Cited by: §2.
 Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.
 Attend, adapt and transfer: attentive deep architecture for adaptive transfer from multiple sources in the same domain. arXiv preprint arXiv:1510.02879. Cited by: §1.1.
 Bayesian policy reuse. Machine Learning 104 (1), pp. 99–127. Cited by: §1.
 Universal value function approximators. In ICML, pp. 1312–1320. Cited by: §5.
 Rewards predictionbased credit assignment for reinforcement learning with sparse binary rewards. IEEE Access 7 (), pp. 118776–118791. External Links: Document, ISSN 21693536 Cited by: §3.4.
 Learning from demonstration for shaping through inverse reinforcement learning. In AAMAS, pp. 429–437. Cited by: §1.1.
 Policy gradient methods for reinforcement learning with function approximation. In NIPS, pp. 1057–1063. Cited by: §2.
 Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2 (4), pp. 160–163. Cited by: §2.
 Value functions for rlbased behavior transfer: a comparative study. In AAAI, Vol. 5, pp. 880–885. Cited by: §1.
 Transfer learning via intertask mappings for temporal difference learning. JMLR 8 (Sep), pp. 2125–2167. Cited by: §1, §5.
 Transfer learning for reinforcement learning domains: a survey. JMLR 10 (Jul), pp. 1633–1685. Cited by: §1, §1.
 A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pp. 300–306. Cited by: §2.
 Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §2, §4.
 Principled methods for advising reinforcement learning agents. In ICML, pp. 792–799. Cited by: §1.1.
 A digital twinbased approach for designing and multiobjective optimization of hollow glass production line. IEEE Access 5, pp. 26901–26911. Cited by: §1.
Appendix
A. Proof of Theorem 1
First, observe that for any stochastic matrix
, , where is the infinity norm, andis always invertible, since the eigenvalues of
always lie in the interior of the unit circle for . Therefore,To simplify notation, we write and . Then and . Now, making use of the identity , we have
and so the proof is complete.
B. Experiment Settings
All code was written and executed using Eclipse PyDev running Python 3.7. All neural networks were initialized and trained using Keras with TensorFlow backend (version 1.14), and weights were initialized using the default setting. The Adam optimizer was used to train all neural networks. Experiments were run on an Intel 6700HQ QuadCore processor with 8 GB RAM running on the Windows 10 operating system.
We used the following hyperparameters in the experiments:
Parameters  

Name  Description  TransferMaze  TransferCartPole  TransferSparseLunarLander 
maximum rollout length  300  500  1000  
discount factor  0.95  0.98  0.99  
exploration probability  0.12  
learning rate of Qlearning*  0.2 (MARS/PBRS), 0.8 (other)  
replay buffer capacity  5000  20000  
batch size  32  64  
topology of DQN  440404  81201004  
hidden activation of DQN  ReLU  ReLU  
learning rate of DQN  0.0005  0.001  
learning rate for termination function weights**  0.4  0.01  0.0001  
target network update frequency (in batches)  500  100  
L2 penalty of DQN  
topology of dynamics model  (4+4)50504  (8+4)100100(6+2)  
hidden activation of dynamics model  ReLU  ReLU  
learning rate of dynamics model  0.001  0.001  
L2 penalty of dynamics model  
Gaussian kernel precision  
topology of mixture model  5830304  430303  830303  
hidden activation of mixture  ReLU  ReLU  ReLU  
learning rate of mixture  0.001  0.001  0.001  
training epochs/batch for mixture 
4  3  1  
PBRS scaling factor  0.1  1.0  20.0  
probability of following source policies***  (MAPSE), (UCB)  (MAPSE), (UCB)  (MAPSE), (UCB) 
* we had to decrease the learning rate for MARS and reward shaping to avoid instability in the learning process due to larger reward
** we report the best value found in
in the deep learning case
*** where is the episode number; we report the best
Comments
There are no comments yet.