Paper reading | CROP
CROP: CERTIFYING ROBUST POLICIES FOR REINFORCEMENT LEARNING THROUGH UNCTIONALSMOOTHING
by Fan Wu et al.
.
Introduction
- This paper asks two questions mainly:
- How to provide efficient and effective robustness certification for RL algorithms?
- What criteria should be used to certify the robustness of RL algorithms?
- Focus on Q-learning with two certification criteria
- Per-state action action stability
Lower bound of perturbed cumulative reward
Contributions of this paper
Propose a framework for certifying the robustness of Q-learning algorithms
- 2 Robustness certification criteria
Prove the certification radius for input state and lower bound of perturbed cumulative reward under bounded adversarial state perturbations
Conduct extensive experiments to provide certification for nine empirically robust RL
algorithms on multiple RL environments
Preliminaries
- Discounted discrete-time MDPs
- Defined by \((\mathcal{S,A},R,P,\gamma,d_0)\), which represent states, discrete actions, reward function, transition function, discount factor and initial state distribution respectively.
Robustness certification in Q-learning
- Consider the standard adversarial setting in Q-learning, where the adversary can apply \(l_2\) bounded perturbation \(\mathcal B^{\varepsilon}\) to input state observations of the agent during decision time.
Robustness certification for Q-learning with different criteria
Def. 1. Given a trained network \(Q^\pi\), define the robustness certification for per-state action as the maximum perturbation magnitude \(\bar \varepsilon\), s.t. for any perturbation \(\delta\in\mathcal B^{\bar\varepsilon}\), the predicted action under the perturbed state will be the same as the action taken in the clean environment, i.e., \(\pi(s+\delta)=\pi(s),\forall \delta\in\mathcal B^{\bar\epsilon}\).
Def. 2. Define the perturbed cumulative reward as following \[ J_{\varepsilon}(\pi)\triangleq \sum_{t=0}^\infty\gamma^t R(s_t,\pi(s_t+\delta_t)),\text{ where }s_{t+1}\sim P(s_t,\pi(s_t+\delta_t)),s_0\sim d_0 \]
Def. 3. The robustness certification of cumulative reward is the lower bound of perturbed cumulative reward \(\underline J\) s.t. \(\underline J\le J_{\varepsilon}(\pi)\) under the perturbation in \(\mathcal B^\varepsilon\) applied to all time steps.
Robustness certification strategies for per-state action
Action-value functional smoothing
Given \(Q^\pi\) with policy \(\pi\), derive a smoothed function \(\tilde Q^\pi\) through per-state local smoothing.
At each time step \(t\), for each action \(a\in \mathcal A\), we draw random noise from a Gaussian distribution \(\mathcal N(0,\sigma^2I_N)\) and do
- Then based on the definition of smoothness, we have the following theorems
Local randomized smoothing
Challenges for Q-learning compared with standard classification task.
Unknown range \([V_{\min},V_{\max}]\)
- Do pre-processing: estimate the output range based on a finite set of valid states
For Q-networks, the outputs aren't probabilities, and calculating the multinomial proportions becomes challenging.
- Hoeffding's inequality
Robustness certification strategies for the cumulative reward
Global smoothing
Perform global smoothing on the state trajectory by viewing the entire trajectory as a function.
Def. 4. Given a state trajectory \((s_0,s_1,\cdots,s_{H-1})\), we derive a \(\sigma\)-randomized trajectory as \((s^\prime_0,s_1^\prime,\cdots,s_{H-1}^\prime)\) where \(s_{t+1}^\prime\sim P(s_t^\prime,\pi(s^\prime_t+\Delta_t)),\Delta_t\sim\mathcal N(0,\sigma^2I_N)\), and \(s_0^\prime=s_0\sim d_0\).
Define \(\oplus\) as concatenation given input states or noise that are added to each state.
Def. 5. Define a bounded perturbed return function \(F_\pi:\mathbb R^{H\times N}\rightarrow [J_{\min},J_{\max}]\) representing cumulative reward with potential perturbation \(\delta\): \[ F_\pi(\oplus_{t=0}^{H-1}\delta_t)\triangleq\sum_{t=0}^H\gamma^tR(s_t,\pi(s_t+\delta_t)),\text{ where }s_{t+1}\sim P(s_t,\pi(s_t+\delta_t)),s_0\sim d_0 \]
Then we can also derive two theorems on the base of above definitions.
Percentile bound.
Local smoothing
- Theorem 4