Title: | Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor |
Authors: | Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine |
Link: | https://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf |
What
Off-policy actor-critic deep RL algorithm with stochastic actor for continuous state and action space maximizing the expected reward plus the expected entropy of the target policy.
Why
Existing model-free deep RL algorithms have very high sample complexity, and sensitive convergence properties, hence extensive hyper-parameter tuning for new domains needed.
How
TL;DR: The Soft Actor-Critic (SAC) algorithm is an off-policy algorithm that optimizes a stochastic policy, concurrently learning a policy and two Q-functions, and it is designed for continuous state and action spaces, aiming to maximize the entropy of the policy to encourage exploration.
SAC aims to solve the problem of maximizing cumulative expected reward plus maximizing the entropy of the policy. The problem is sometimes referred to as Entropy-Regularized1 Reinforcement Learning (RL), and is different than the standard RL problem of maximizing cumulative expected reward.
The new RL objective is:
where
-
is the entropy of the policyH ( π ( ⋅ ∣ S t ) ) = . E π [ − log ( π ( ⋅ ∣ S t ) ) ] \mathcal{H} \big( \pi( \cdot | S_t ) \big) \stackrel{.}{=} \mathbb{E}_{\pi}\Big[ -\log \big( \pi( \cdot | S_t ) \big) \Big] .π \pi
Soft Approximate Policy Iteration (s-API)
Instead of running policy iteration with policy evaluation and policy improvement to convergence, the two steps are each approximated with some number of gradient descent steps.
The value function, two action-value functions, and policy are parametrized using function approximation by
Policy Evaluation
In policy evaluation, the state-value and action-value functions are updated.
State-value function
where
The gradient of Equation 2 is:
Action-value function
The gradient of Equation 5 is:
are the parameters of the target state-value function 4.w target \mathbf{w^{\text{target}}}
Policy Improvement
In policy improvement step, the policy is updated using the value functions from previous step.
In this paper, they minimize the KL-divergence between the policy and a Boltzmann distribution.
To take the gradient of Equation 7, the paper uses standard backpropagation. Since the KL divergence is an expectation over a distribution that depends on the parameters of the policy, we need to remove this dependence since we cannot take derivatives of a stochastic quantity. To do this, we use the reparametrization trick, which moves the stochasticity of the distribution away from the learnable parameters. This doesn’t work for every distribution, but it does for Gaussians.
The policy is parametrized using function approximation to estimate the mean and standard deviation of a Gaussian squashed by a
With this reparametrization the Equation 7 is now:
The gradient of Equation 8 is:
Pseudocode
Below is the pseudocode algorithm specification of SAC using a Py-like syntax.
Soft-Actor Critic for estimating
## Inputs
# Input: differentiable state-value function approximation v_hat(s | w)
v_hat = v_hat.init(w)
# Input: target state-value function approximation v_hat_bar(s | w_target)
v_hat_bar = v_hat.copy()
# Input: differentiable soft state-action function approximation q_hat_1(s, a | theta_q1)
q_hat_1 = theta_q1.init(theta_q_1)
# Input: differentiable soft state-action function approximation q_hat_2(s, a | theta_q2)
q_hat_2 = q_hat_2.init(theta_q_2)
# Input: differentiable policy parameterization pi(a | s, theta_pi)
pi = pi.init(theta_pi)
## Parameters
# where appropriate the values are from the paper
# Stepsizes
alpha_v_hat, alpha_q_hat, alpha_pi = 0.003, 0.003, 0.003
# entropy coefficient
beta = 0.01
# Polyak averaging decay
tau = 0.995
# Discount factor
gamma = 0.99
## Initialize
B = ReplayBuffer(size=1000_000)
s = env.reset()
## Define
# double q-function minimum state-action function approximation q_hat(s, a)
q_hat = lambda s, a: min(q_hat_1(s, a), q_hat_2(s, a))
while not done:
a = pi(s)
s_next, r, done = env.step(a)
B.append((s, a, r, s_next))
batch = B.sample(size=100)
for example in batch:
# Update state-value function
J_v_hat = 0.5 * ((v_hat(s) - (q_hat(s, a) - log(pi(s, theta_pi)) ))**2)
w = w - alpha_v_hat * grad(J_v_hat)
# Update target state-value function
v_hat_bar = tau * v_hat + (1 - tau) * v_hat_bar
# Update state-action functions
for i in range(2):
J_q = 0.5 * ((q_hat_i(s, a, theta_q_i) -
(r + gamma * v_hat_bar(s_next, theta_q_i)))**2
)
theta_q_i = theta_q_i - alpha_q_hat * grad(J_q)
# Update policy parameters
J_theta_pi = (
KL(pi(s, theta_pi), exp(q_hat(q_hat(s, a))))
)
theta_pi = theta_pi - alpha_pi * grad(J_theta_pi)
s = s_next
Thoughts
- I’m skeptical about the experimental setup, as some of the results don’t agree with the TD3 paper (Fujimoto et. al 2018) that was done concurrently with this work.
- To my understanding, SAC as developed above is not a true policy gradient method since the policy is not updated using the policy gradient theorem.
- Would be interesting to know how SAC performs if the policy is optimized using the likelihood ration from the policy gradient theorem instead of doing backpropagation through the action-value function.
References
-
https://spinningup.openai.com/en/latest/algorithms/sac.html#id6 ↩︎
-
Also known as temperature parameter in the literature. ↩︎
-
In the limit
, the standard RL objective is recovered. ↩︎α → 0 \alpha \rightarrow 0 -
This target is used to make the updates of the value function stable. They can be updated using an exponentially moving average (cf. Polyak averaging) of the actual state-value function parameters
or updated periodically. ↩︎w \mathbf{w}