Paper reading - CROP


Paper reading | CROP

CROP: CERTIFYING ROBUST POLICIES FOR REINFORCEMENT LEARNING THROUGH UNCTIONALSMOOTHING by Fan Wu et al..

Introduction

  • This paper asks two questions mainly:
  • How to provide efficient and effective robustness certification for RL algorithms?
  • What criteria should be used to certify the robustness of RL algorithms?
  • Focus on Q-learning with two certification criteria
  • Per-state action action stability
  • Lower bound of perturbed cumulative reward

  • Contributions of this paper

  • Propose a framework for certifying the robustness of Q-learning algorithms

    • 2 Robustness certification criteria
  • Prove the certification radius for input state and lower bound of perturbed cumulative reward under bounded adversarial state perturbations

  • Conduct extensive experiments to provide certification for nine empirically robust RL

    algorithms on multiple RL environments

Preliminaries

  • Discounted discrete-time MDPs
  • Defined by \((\mathcal{S,A},R,P,\gamma,d_0)\), which represent states, discrete actions, reward function, transition function, discount factor and initial state distribution respectively.

Robustness certification in Q-learning

  • Consider the standard adversarial setting in Q-learning, where the adversary can apply \(l_2\) bounded perturbation \(\mathcal B^{\varepsilon}\) to input state observations of the agent during decision time.

Robustness certification for Q-learning with different criteria

  • Def. 1. Given a trained network \(Q^\pi\), define the robustness certification for per-state action as the maximum perturbation magnitude \(\bar \varepsilon\), s.t. for any perturbation \(\delta\in\mathcal B^{\bar\varepsilon}\), the predicted action under the perturbed state will be the same as the action taken in the clean environment, i.e., \(\pi(s+\delta)=\pi(s),\forall \delta\in\mathcal B^{\bar\epsilon}\).

  • Def. 2. Define the perturbed cumulative reward as following \[ J_{\varepsilon}(\pi)\triangleq \sum_{t=0}^\infty\gamma^t R(s_t,\pi(s_t+\delta_t)),\text{ where }s_{t+1}\sim P(s_t,\pi(s_t+\delta_t)),s_0\sim d_0 \]

  • Def. 3. The robustness certification of cumulative reward is the lower bound of perturbed cumulative reward \(\underline J\) s.t. \(\underline J\le J_{\varepsilon}(\pi)\) under the perturbation in \(\mathcal B^\varepsilon\) applied to all time steps.

Robustness certification strategies for per-state action

Action-value functional smoothing

  • Given \(Q^\pi\) with policy \(\pi\), derive a smoothed function \(\tilde Q^\pi\) through per-state local smoothing.

  • At each time step \(t\), for each action \(a\in \mathcal A\), we draw random noise from a Gaussian distribution \(\mathcal N(0,\sigma^2I_N)\) and do

  • Then based on the definition of smoothness, we have the following theorems

Local randomized smoothing

  • Challenges for Q-learning compared with standard classification task.

  • Unknown range \([V_{\min},V_{\max}]\)

    • Do pre-processing: estimate the output range based on a finite set of valid states
  • For Q-networks, the outputs aren't probabilities, and calculating the multinomial proportions becomes challenging.

    • Hoeffding's inequality

Robustness certification strategies for the cumulative reward

Global smoothing

  • Perform global smoothing on the state trajectory by viewing the entire trajectory as a function.

  • Def. 4. Given a state trajectory \((s_0,s_1,\cdots,s_{H-1})\), we derive a \(\sigma\)-randomized trajectory as \((s^\prime_0,s_1^\prime,\cdots,s_{H-1}^\prime)\) where \(s_{t+1}^\prime\sim P(s_t^\prime,\pi(s^\prime_t+\Delta_t)),\Delta_t\sim\mathcal N(0,\sigma^2I_N)\), and \(s_0^\prime=s_0\sim d_0\).

Define \(\oplus\) as concatenation given input states or noise that are added to each state.

  • Def. 5. Define a bounded perturbed return function \(F_\pi:\mathbb R^{H\times N}\rightarrow [J_{\min},J_{\max}]\) representing cumulative reward with potential perturbation \(\delta\): \[ F_\pi(\oplus_{t=0}^{H-1}\delta_t)\triangleq\sum_{t=0}^H\gamma^tR(s_t,\pi(s_t+\delta_t)),\text{ where }s_{t+1}\sim P(s_t,\pi(s_t+\delta_t)),s_0\sim d_0 \]

  • Then we can also derive two theorems on the base of above definitions.

  • Percentile bound.

Local smoothing

  • Theorem 4


Author: Maxwell Yao
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Maxwell Yao !
  TOC