Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we find that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we find that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay.
| Method | SR ↑ | NBT ↓ |
|---|---|---|
| Pi0 | 0.768±.017 | -0.016±.022 |
| GR00T N1.5 | 0.919±.011 | 0.027±.021 |
| BC-Diffusion | 0.696±.068 | 0.127±.071 |
| BC-Transformer | 0.585±.066 | 0.245±.080 |
| BC-ViT | 0.508±.142 | 0.193±.082 |
| Method | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal | LIBERO-10 | ||||
|---|---|---|---|---|---|---|---|---|
| SR ↑ | NBT ↓ | SR ↑ | NBT ↓ | SR ↑ | NBT ↓ | SR ↑ | NBT ↓ | |
| Pi0 | 0.879±.008 | 0.019±.019 | 0.897±.011 | -0.011±.034 | 0.732±.015 | -0.005±.010 | 0.563±.028 | -0.068±.018 |
| GR00T N1.5 | 0.940±.013 | 0.007±.009 | 0.975±.004 | 0.019±.013 | 0.943±.004 | 0.023±.017 | 0.820±.017 | 0.059±.035 |
| BC-Diffusion | 0.663±.074 | 0.050±.070 | 0.847±.069 | 0.025±.056 | 0.809±.089 | 0.230±.101 | 0.464±.024 | 0.201±.044 |
| BC-Transformer | 0.659±.058 | 0.299±.096 | 0.595±.112 | 0.132±.120 | 0.709±.022 | 0.356±.042 | 0.376±.034 | 0.192±.019 |
| BC-ViT | 0.513±.069 | 0.171±.065 | 0.543±.268 | 0.077±.134 | 0.661±.024 | 0.319±.022 | 0.316±.062 | 0.204±.063 |
We find that knowledge is compartmentalized across VLA components: the vision-language (VL) backbone is the dominant source of forgetting, while the action head retains most of its prior task knowledge:
In the finetuning experiments, Pi0 reaches peak performance within a small fraction of training steps, significantly faster than its initial training. In contrast, BC-Transformer requires a similar number of steps as initial training to reach peak performance, suggesting that task knowledge has been largely erased:
In terms of recovery efficiency, Pi0 consistently recovers peak performance with fewer than 10% of the original training steps, whereas BC-Transformer often requires a comparable or greater number of steps:
| Benchmark | Pi0 | BC-Transformer |
|---|---|---|
| LIBERO-Spatial | 0.066 | 1.36 |
| LIBERO-10 | 0.105 | 1.87 |
| LIBERO-Object | 0.067 | 1.80 |
| LIBERO-Goal | 0.062 | 0.33 |