Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

University of Maryland, College Park1    JP Morgan AI Research2
ICLR 2024 spotlight

*Equal Contribution
Description of First Image

Framework of PROTECTED: Pre-training Non-dominated Policies towards Online Adaptation

Abstract

In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond merely worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic difficulty in achieving sublinear regret when the baseline policy is from a general continuous policy class, $\Pi$. This finding prompts us to refine the baseline policy class prior to test time, aiming for efficient adaptation within a compact, finite policy class $\widetilde{\Pi}$, which can resort to an adversarial bandit subroutine. In light of the importance of a finite and compact $\widetilde{\Pi}$, we propose a novel training-time algorithm to iteratively discover non-dominated policies, forming a near-optimal and minimal $\widetilde{\Pi}$ , thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.

Motivation



"Is it possible to develop a comprehensive framework that enhances the performance of the victim against non-worst-case attacks, while maintaining robustness against worst-case scenarios?"

Test-time Online Adaptation

Instead of employing a static victim policy, we propose adaptively selecting policies based on online reward feedback during test time. Although the sublinear regret is not guaranteed in the broader policy class $\Pi$, it is achievable in a smaller, finite but refined policy class $\widetilde{\Pi}$.

Given such a refined policy class, we can perform online adaptation with adversarial bandit algorithm such as EXP3, which maintains a meta-policy during online adaptation and adjusts the weight of each policy based on the online reward feedback.

alg1

Iterative Discovery in Pre-training

While there is a trade-off between optimality (the gap of maximum rewards in the original policy class $\Pi$ and the refined policy class $\widetilde{\Pi}$) and efficiency (the cardinality of the refined policy class $|\widetilde{\Pi}|$), it is proved that a zero gap is achivable with a finite refined policy class. But this finite policy class can be still relatively large. To effectively reduce the cardinality of $\widetilde{\Pi}$, we propose an iterative pre-training approach to construct $\widetilde{\Pi}$.

For each iteration, a new non-dominated policy (red dot), whose reward aginst at least one attack outperforms all meta-policies in $\widetilde{\Pi}$, is added to $\widetilde{\Pi}$. This process effectively removes those redundant dominated policies (the orange area) while maintaining the optimality of $\widetilde{\Pi}$.

alg2

Experimental Results

Our methods yield considerably higher natural rewards and consistently enhanced robustness against a spectrum of attacks in various Mujoco environments.

Description of First Image

Moreover, in the scenarios where the attacker can exhibit dynamic behavior, the best policy within non-dominated policy class $\widetilde{\Pi}$ can be identified by EXP3 rapidly and reliably.

BibTeX

@inproceedings{
        liu2024beyond,
        title={Beyond Worst-case Attacks: Robust {RL} with Adaptive Defense via Non-dominated Policies},
        author={Xiangyu Liu and Chenghao Deng and Yanchao Sun and Yongyuan Liang and Furong Huang},
        booktitle={The Twelfth International Conference on Learning Representations},
        year={2024},
        url={https://openreview.net/forum?id=DFTHW0MyiW}
}